No Arabic abstract
We motivate and propose a new model for non-cooperative Markov game which considers the interactions of risk-aware players. This model characterizes the time-consistent dynamic risk from both stochastic state transitions (inherent to the game) and randomized mixed strategies (due to all other players). An appropriate risk-aware equilibrium concept is proposed and the existence of such equilibria is demonstrated in stationary strategies by an application of Kakutanis fixed point theorem. We further propose a simulation-based Q-learning type algorithm for risk-aware equilibrium computation. This algorithm works with a special form of minimax risk measures which can naturally be written as saddle-point stochastic optimization problems, and covers many widely investigated risk measures. Finally, the almost sure convergence of this simulation-based algorithm to an equilibrium is demonstrated under some mild conditions. Our numerical experiments on a two player queuing game validate the properties of our model and algorithm, and demonstrate their worth and applicability in real life competitive decision-making.
We consider a new setting of facility location games with ordinal preferences. In such a setting, we have a set of agents and a set of facilities. Each agent is located on a line and has an ordinal preference over the facilities. Our goal is to design strategyproof mechanisms that elicit truthful information (preferences and/or locations) from the agents and locate the facilities to minimize both maximum and total cost objectives as well as to maximize both minimum and total utility objectives. For the four possible objectives, we consider the 2-facility settings in which only preferences are private, or locations are private. For each possible combination of the objectives and settings, we provide lower and upper bounds on the approximation ratios of strategyproof mechanisms, which are asymptotically tight up to a constant. Finally, we discuss the generalization of our results beyond two facilities and when the agents can misreport both locations and preferences.
In an adversarial environment, a hostile player performing a task may behave like a non-hostile one in order not to reveal its identity to an opponent. To model such a scenario, we define identity concealment games: zero-sum stochastic reachability games with a zero-sum objective of identity concealment. To measure the identity concealment of the player, we introduce the notion of an average player. The average players policy represents the expected behavior of a non-hostile player. We show that there exists an equilibrium policy pair for every identity concealment game and give the optimality equations to synthesize an equilibrium policy pair. If the players opponent follows a non-equilibrium policy, the player can hide its identity better. For this reason, we study how the hostile player may learn the opponents policy. Since learning via exploration policies would quickly reveal the hostile players identity to the opponent, we consider the problem of learning a near-optimal policy for the hostile player using the game runs collected under the average players policy. Consequently, we propose an algorithm that provably learns a near-optimal policy and give an upper bound on the number of sample runs to be collected.
In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account emph{risk}, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.
This paper examines the convergence of no-regret learning in Cournot games with continuous actions. Cournot games are the essential model for many socio-economic systems, where players compete by strategically setting their output quantity. We assume that players do not have full information of the game and thus cannot pre-compute a Nash equilibrium. Two types of feedback are considered: one is bandit feedback and the other is gradient feedback. To study the convergence of the induced sequence of play, we introduce the notion of convergence in measure, and show that the players actual sequence of action converges to the unique Nash equilibrium. In addition, our results naturally extend the no-regret learning algorithms time-average regret bounds to obtain the final-iteration convergence rates. Together, our work presents significantly sharper convergence results for learning in games without strong assumptions on game property (e.g., monotonicity) and shows how exploiting the game information feedback can influence the convergence rates.
We initiate the work on maximin share (MMS) fair allocation of m indivisible chores to n agents using only their ordinal preferences, from both algorithmic and mechanism design perspectives. The previous best-known approximation is 2-1/n by Aziz et al. [IJCAI 2017]. We improve this result by giving a simple deterministic 5/3-approximation algorithm that determines an allocation sequence of agents, according to which items are allocated one by one. By a tighter analysis, we show that for n=2,3, our algorithm achieves better approximation ratios, and is actually optimal. We also consider the setting with strategic agents, where agents may misreport their preferences to manipulate the outcome. We first provide a O(log (m/n))-approximation consecutive picking algorithm, and then improve the approximation ratio to O(sqrt{log n}) by a randomized algorithm. Our results uncover some interesting contrasts between the approximation ratios achieved for chores versus goods.