Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

123 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yangguang Shi

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yuval Emek - Ron Lavi - Rad Niazadeh

علوم الكمبيوتر ونظرية الألعاب

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, a rather general online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. As the existing online learning techniques do not yield vanishing-regret mechanisms for this problem, we develop a novel online learning framework defined over deterministic Markov decision processes with dynamic state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit an oracle that can simulate any given policy from any initial state with bounded loss -- a condition that is satisfied in the DRACC problem -- then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to online learning with switching cost, in which an online decision maker incurs an extra cost every time she switches from one arm to another. We formally demonstrate this connection and further show how DRACC can be used in our proposed applications of stateful pricing.

قيم البحث

89 - Giulia Romano , Gianluca Tartaglia , Alberto Marchesi 2020

We study the problem of designing posted-price mechanisms in order to sell a single unit of a single item within a finite period of time. Motivated by real-world problems, such as, e.g., long-term rental of rooms and apartments, we assume that custom ers arrive online according to a Poisson process, and their valuations are drawn from an unknown distribution and discounted over time. We evaluate our mechanisms in terms of competitive ratio, measuring the worst-case ratio between their revenue and that of an optimal mechanism that knows the distribution of valuations. First, we focus on the identical valuation setting, where all the customers value the item for the same amount. In this setting, we provide a mechanism M_c that achieves the best possible competitive ratio, discussing its dependency on the parameters in the case of linear discount. Then, we switch to the random valuation setting. We show that, if we restrict the attention to distributions of valuations with a monotone hazard rate, then the competitive ratio of M_c is lower bounded by a strictly positive constant that does not depend on the distribution. Moreover, we provide another mechanism, called M_pc, which is defined by a piecewise constant pricing strategy and reaches performances comparable to those obtained with M_c. This mechanism is useful when the seller cannot change the posted price too often. Finally, we empirically evaluate the performances of our mechanisms in a number of experimental settings.

علوم الكمبيوتر ونظرية الألعاب

Limit Synchronization in Markov Decision Processes

123 - Laurent Doyen , Thierry Massart , Mahsa Shirmohammadi 2013

Markov decision processes (MDP) are finite-state systems with both strategic and probabilistic choices. After fixing a strategy, an MDP produces a sequence of probability distributions over states. The sequence is eventually synchronizing if the prob ability mass accumulates in a single state, possibly in the limit. Precisely, for 0 <= p <= 1 the sequence is p-synchronizing if a probability distribution in the sequence assigns probability at least p to some state, and we distinguish three synchronization modes: (i) sure winning if there exists a strategy that produces a 1-synchronizing sequence; (ii) almost-sure winning if there exists a strategy that produces a sequence that is, for all epsilon > 0, a (1-epsilon)-synchronizing sequence; (iii) limit-sure winning if for all epsilon > 0, there exists a strategy that produces a (1-epsilon)-synchronizing sequence. We consider the problem of deciding whether an MDP is sure, almost-sure, limit-sure winning, and we establish the decidability and optimal complexity for all modes, as well as the memory requirements for winning strategies. Our main contributions are as follows: (a) for each winning modes we present characterizations that give a PSPACE complexity for the decision problems, and we establish matching PSPACE lower bounds; (b) we show that for sure winning strategies, exponential memory is sufficient and may be necessary, and that in general infinite memory is necessary for almost-sure winning, and unbounded memory is necessary for limit-sure winning; (c) along with our results, we establish new complexity results for alternating finite automata over a one-letter alphabet.

علوم الكمبيوتر ونظرية الألعاب

Max-Plus Matching Pursuit for Deterministic Markov Decision Processes

114 - Francis Bach 2019

We consider deterministic Markov decision processes (MDPs) and apply max-plus algebra tools to approximate the value iteration algorithm by a smaller-dimensional iteration based on a representation on dictionaries of value functions. The setup natura lly leads to novel theoretical results which are simply formulated due to the max-plus algebra structure. For example, when considering a fixed (non adaptive) finite basis, the computational complexity of approximating the optimal value function is not directly related to the number of states, but to notions of covering numbers of the state space. In order to break the curse of dimensionality in factored state-spaces, we consider adaptive basis that can adapt to particular problems leading to an algorithm similar to matching pursuit from signal processing. They currently come with no theoretical guarantees but work empirically well on simple deterministic MDPs derived from low-dimensional continuous control problems. We focus primarily on deterministic MDPs but note that the framework can be applied to all MDPs by considering measure-based formulations.

التعلم الآلي التعلم الالي

Asymptotically Optimal Welfare of Posted Pricing for Multiple Items with MHR Distributions

84 - Alexander Braun , Matthias Buttkus , Thomas Kesselheim 2021

We consider the problem of posting prices for unit-demand buyers if all $n$ buyers have identically distributed valuations drawn from a distribution with monotone hazard rate. We show that even with multiple items asymptotically optimal welfare can b e guaranteed. Our main results apply to the case that either a buyers value for different items are independent or that they are perfectly correlated. We give mechanisms using dynamic prices that obtain a $1 - Theta left( frac{1}{log n}right)$-fraction of the optimal social welfare in expectation. Furthermore, we devise mechanisms that only use static item prices and are $1 - Theta left( frac{logloglog n}{log n}right)$-competitive compared to the optimal social welfare. As we show, both guarantees are asymptotically optimal, even for a single item and exponential distributions.

علوم الكمبيوتر ونظرية الألعاب بنى وهياكل البيانات والخوارزميات

Scalable methods for computing state similarity in deterministic Markov Decision Processes

160 - Pablo Samuel Castro 2019

We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant formalism that capture behavioral equivalence between states and provide strong theoretical guaran tees on differences in optimal behaviour. Unfortunately, their computation is expensive and requires a tabular representation of the states, which has thus far rendered them impractical for large problems. In this paper we present a new version of the metric that is tied to a behavior policy in an MDP, along with an analysis of its theoretical properties. We then present two new algorithms for approximating bisimulation metrics in large, deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation even for continuous state MDPs, which prior to this work had not been possible.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة سوهاج

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً