ترغب بنشر مسار تعليمي؟ اضغط هنا

Stateless Reinforcement Learning for Multi-Agent Systems: the Case of Spectrum Allocation in Dynamic Channel Bonding WLANs

65   0   0.0 ( 0 )
 نشر من قبل Sergio Barrachina-Mu\\~noz Dr
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Spectrum allocation in the form of primary channel and bandwidth selection is a key factor for dynamic channel bonding (DCB) wireless local area networks (WLANs). To cope with varying environments, where networks change their configurations on their own, the wireless community is looking towards solutions aided by machine learning (ML), and especially reinforcement learning (RL) given its trial-and-error approach. However, strong assumptions are normally made to let complex RL models converge to near-optimal solutions. Our goal with this paper is two-fold: justify in a comprehensible way why RL should be the approach for wireless networks problems like decentralized spectrum allocation, and call into question whether the use of complex RL algorithms helps the quest of rapid learning in realistic scenarios. We derive that stateless RL in the form of lightweight multi-armed-bandits (MABs) is an efficient solution for rapid adaptation avoiding the definition of extensive or meaningless RL states.


قيم البحث

اقرأ أيضاً

145 - Xiang Tan , Li Zhou , Haijun Wang 2021
With the development of the 5G and Internet of Things, amounts of wireless devices need to share the limited spectrum resources. Dynamic spectrum access (DSA) is a promising paradigm to remedy the problem of inefficient spectrum utilization brought u pon by the historical command-and-control approach to spectrum allocation. In this paper, we investigate the distributed DSA problem for multi-user in a typical multi-channel cognitive radio network. The problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), and we proposed a centralized off-line training and distributed on-line execution framework based on cooperative multi-agent reinforcement learning (MARL). We employ the deep recurrent Q-network (DRQN) to address the partial observability of the state for each cognitive user. The ultimate goal is to learn a cooperative strategy which maximizes the sum throughput of cognitive radio network in distributed fashion without coordination information exchange between cognitive users. Finally, we validate the proposed algorithm in various settings through extensive experiments. From the simulation results, we can observe that the proposed algorithm can converge fast and achieve almost the optimal performance.
84 - Weiheng Jiang , Wanxin Yu 2021
Designing clustered unmanned aerial vehicle (UAV) communication networks based on cognitive radio (CR) and reinforcement learning can significantly improve the intelligence level of clustered UAV communication networks and the robustness of the syste m in a time-varying environment. Among them, designing smarter systems for spectrum sensing and access is a key research issue in CR. Therefore, we focus on the dynamic cooperative spectrum sensing and channel access in clustered cognitive UAV (CUAV) communication networks. Due to the lack of prior statistical information on the primary user (PU) channel occupancy state, we propose to use multi-agent reinforcement learning (MARL) to model CUAV spectrum competition and cooperative decision-making problem in this dynamic scenario, and a return function based on the weighted compound of sensing-transmission cost and utility is introduced to characterize the real-time rewards of multi-agent game. On this basis, a time slot multi-round revisit exhaustive search algorithm based on virtual controller (VC-EXH), a Q-learning algorithm based on independent learner (IL-Q) and a deep Q-learning algorithm based on independent learner (IL-DQN) are respectively proposed. Further, the information exchange overhead, execution complexity and convergence of the three algorithms are briefly analyzed. Through the numerical simulation analysis, all three algorithms can converge quickly, significantly improve system performance and increase the utilization of idle spectrum resources.
Unmanned aerial vehicles (UAVs) are capable of serving as aerial base stations (BSs) for providing both cost-effective and on-demand wireless communications. This article investigates dynamic resource allocation of multiple UAVs enabled communication networks with the goal of maximizing long-term rewards. More particularly, each UAV communicates with a ground user by automatically selecting its communicating users, power levels and subchannels without any information exchange among UAVs. To model the uncertainty of environments, we formulate the long-term resource allocation problem as a stochastic game for maximizing the expected rewards, where each UAV becomes a learning agent and each resource allocation solution corresponds to an action taken by the UAVs. Afterwards, we develop a multi-agent reinforcement learning (MARL) framework that each agent discovers its best strategy according to its local observations using learning. More specifically, we propose an agent-independent method, for which all agents conduct a decision algorithm independently but share a common structure based on Q-learning. Finally, simulation results reveal that: 1) appropriate parameters for exploitation and exploration are capable of enhancing the performance of the proposed MARL based resource allocation algorithm; 2) the proposed MARL algorithm provides acceptable performance compared to the case with complete information exchanges among UAVs. By doing so, it strikes a good tradeoff between performance gains and information exchange overheads.
In reinforcement learning, agents learn by performing actions and observing their outcomes. Sometimes, it is desirable for a human operator to textit{interrupt} an agent in order to prevent dangerous situations from happening. Yet, as part of their l earning process, agents may link these interruptions, that impact their reward, to specific states and deliberately avoid them. The situation is particularly challenging in a multi-agent context because agents might not only learn from their own past interruptions, but also from those of other agents. Orseau and Armstrong defined emph{safe interruptibility} for one learner, but their work does not naturally extend to multi-agent systems. This paper introduces textit{dynamic safe interruptibility}, an alternative definition more suited to decentralized learning problems, and studies this notion in two learning frameworks: textit{joint action learners} and textit{independent learners}. We give realistic sufficient conditions on the learning algorithm to enable dynamic safe interruptibility in the case of joint action learners, yet show that these conditions are not sufficient for independent learners. We show however that if agents can detect interruptions, it is possible to prune the observations to ensure dynamic safe interruptibility even for independent learners.
In this paper, the problem of dynamic spectrum sensing and aggregation is investigated in a wireless network containing N correlated channels, where these channels are occupied or vacant following an unknown joint 2-state Markov model. At each time s lot, a single cognitive user with certain bandwidth requirement either stays idle or selects a segment comprising C (C < N) contiguous channels to sense. Then, the vacant channels in the selected segment will be aggregated for satisfying the user requirement. The user receives a binary feedback signal indicating whether the transmission is successful or not (i.e., ACK signal) after each transmission, and makes next decision based on the sensing channel states. Here, we aim to find a policy that can maximize the number of successful transmissions without interrupting the primary users (PUs). The problem can be considered as a partially observable Markov decision process (POMDP) due to without full observation of system environment. We implement a Deep Q-Network (DQN) to address the challenge of unknown system dynamics and computational expenses. The performance of DQN, Q-Learning, and the Improvident Policy with known system dynamics is evaluated through simulations. The simulation results show that DQN can achieve near-optimal performance among different system scenarios only based on partial observations and ACK signals.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا