ترغب بنشر مسار تعليمي؟ اضغط هنا

Bayesian Exploration: Incentivizing Exploration in Bayesian Games

561   0   0.0 ( 0 )
 نشر من قبل Aleksandrs Slivkins
 تاريخ النشر 2016
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We consider a ubiquitous scenario in the Internet economy when individual decision-makers (henceforth, agents) both produce and consume information as they make strategic choices in an uncertain environment. This creates a three-way tradeoff between exploration (trying out insufficiently explored alternatives to help others in the future), exploitation (making optimal decisions given the information discovered by other agents), and incentives of the agents (who are myopically interested in exploitation, while preferring the others to explore). We posit a principal who controls the flow of information from agents that came before, and strives to coordinate the agents towards a socially optimal balance between exploration and exploitation, not using any monetary transfers. The goal is to design a recommendation policy for the principal which respects agents incentives and minimizes a suitable notion of regret. We extend prior work in this direction to allow the agents to interact with one another in a shared environment: at each time step, multiple agents arrive to play a Bayesian game, receive recommendations, choose their actions, receive their payoffs, and then leave the game forever. The agents now face two sources of uncertainty: the actions of the other agents and the parameters of the uncertain game environment. Our main contribution is to show that the principal can achieve constant regret when the utilities are deterministic (where the constant depends on the prior distribution, but not on the time horizon), and logarithmic regret when the utilities are stochastic. As a key technical tool, we introduce the concept of explorable actions, the actions which some incentive-compatible policy can recommend with non-zero probability. We show how the principal can identify (and explore) all explorable actions, and use the revealed information to perform optimally.



قيم البحث

اقرأ أيضاً

We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the informa tion asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the strength of beliefs.
Individual decision-makers consume information revealed by the previous decision makers, and produce information that may help in future decisions. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in other do mains such as medical decisions. Each decision-maker would individually prefer to exploit: select an action with the highest expected reward given her current information. At the same time, each decision-maker would prefer previous decision-makers to explore, producing information about the rewards of various actions. A social planner, by means of carefully designed information disclosure, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare. We formulate this problem as a multi-armed bandit problem (and various generalizations thereof) under incentive-compatibility constraints induced by the agents Bayesian priors. We design an incentive-compatible bandit algorithm for the social planner whose regret is asymptotically optimal among all bandit algorithms (incentive-compatible or not). Further, we provide a black-box reduction from an arbitrary multi-arm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit setting that incorporate contexts and arbitrary auxiliary feedback.
We study network coordination problems, as captured by the setting of generalized network design (Emek et al., STOC 2018), in the face of uncertainty resulting from partial information that the network users hold regarding the actions of their peers. This uncertainty is formalized using Alon et al.s Bayesian ignorance framework (TCS 2012). While the approach of Alon et al. is purely combinatorial, the current paper takes into account computational considerations: Our main technical contribution is the development of (strongly) polynomial time algorithms for local decision making in the face of Bayesian uncertainty.
We study the problem of incentivizing exploration for myopic users in linear bandits, where the users tend to exploit arm with the highest predicted reward instead of exploring. In order to maximize the long-term reward, the system offers compensatio n to incentivize the users to pull the exploratory arms, with the goal of balancing the trade-off among exploitation, exploration and compensation. We consider a new and practically motivated setting where the context features observed by the user are more informative than those used by the system, e.g., features based on users private information are not accessible by the system. We propose a new method to incentivize exploration under such information gap, and prove that the method achieves both sublinear regret and sublinear compensation. We theoretical and empirically analyze the added compensation due to the information gap, compared with the case that the system has access to the same context features as the user, i.e., without information gap. We also provide a compensation lower bound of our problem.
Network congestion games are a well-understood model of multi-agent strategic interactions. Despite their ubiquitous applications, it is not clear whether it is possible to design information structures to ameliorate the overall experience of the net work users. We focus on Bayesian games with atomic players, where network vagaries are modeled via a (random) state of nature which determines the costs incurred by the players. A third-party entity---the sender---can observe the realized state of the network and exploit this additional information to send a signal to each player. A natural question is the following: is it possible for an informed sender to reduce the overall social cost via the strategic provision of information to players who update their beliefs rationally? The paper focuses on the problem of computing optimal ex ante persuasive signaling schemes, showing that symmetry is a crucial property for its solution. Indeed, we show that an optimal ex ante persuasive signaling scheme can be computed in polynomial time when players are symmetric and have affine cost functions. Moreover, the problem becomes NP-hard when players are asymmetric, even in non-Bayesian settings.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا