ترغب بنشر مسار تعليمي؟ اضغط هنا

Optimal No-regret Learning in Repeated First-price Auctions

80   0   0.0 ( 0 )
 نشر من قبل Yanjun Han
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We study online learning in repeated first-price auctions with censored feedback, where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces a challenging dilemma: if she wins the bid--the only way to achieve positive payoffs--then she is not able to observe the highest bid of the other bidders, which we assume is iid drawn from an unknown distribution. This dilemma, despite being reminiscent of the exploration-exploitation trade-off in contextual bandits, cannot directly be addressed by the existing UCB or Thompson sampling algorithms in that literature, mainly because contrary to the standard bandits setting, when a positive reward is obtained here, nothing about the environment can be learned. In this paper, by exploiting the structural properties of first-price auctions, we develop the first learning algorithm that achieves $O(sqrt{T}log^2 T)$ regret bound when the bidders private values are stochastically generated. We do so by providing an algorithm on a general class of problems, which we call monotone group contextual bandits, where the same regret bound is established under stochastically generated contexts. Further, by a novel lower bound argument, we characterize an $Omega(T^{2/3})$ lower bound for the case where the contexts are adversarially generated, thus highlighting the impact of the contexts generation mechanism on the fundamental learning limit. Despite this, we further exploit the structure of first-price auctions and develop a learning algorithm that operates sample-efficiently (and computationally efficiently) in the presence of adversarially generated private values. We establish an $O(sqrt{T}log^3 T)$ regret bound for this algorithm, hence providing a complete characterization of optimal learning guarantees for this problem.



قيم البحث

اقرأ أيضاً

In this paper, we investigate the problem about how to bid in repeated contextual first price auctions. We consider a single bidder (learner) who repeatedly bids in the first price auctions: at each time $t$, the learner observes a context $x_tin mat hbb{R}^d$ and decides the bid based on historical information and $x_t$. We assume a structured linear model of the maximum bid of all the others $m_t = alpha_0cdot x_t + z_t$, where $alpha_0in mathbb{R}^d$ is unknown to the learner and $z_t$ is randomly sampled from a noise distribution $mathcal{F}$ with log-concave density function $f$. We consider both emph{binary feedback} (the learner can only observe whether she wins or not) and emph{full information feedback} (the learner can observe $m_t$) at the end of each time $t$. For binary feedback, when the noise distribution $mathcal{F}$ is known, we propose a bidding algorithm, by using maximum likelihood estimation (MLE) method to achieve at most $widetilde{O}(sqrt{log(d) T})$ regret. Moreover, we generalize this algorithm to the setting with binary feedback and the noise distribution is unknown but belongs to a parametrized family of distributions. For the full information feedback with emph{unknown} noise distribution, we provide an algorithm that achieves regret at most $widetilde{O}(sqrt{dT})$. Our approach combines an estimator for log-concave density functions and then MLE method to learn the noise distribution $mathcal{F}$ and linear weight $alpha_0$ simultaneously. We also provide a lower bound result such that any bidding policy in a broad class must achieve regret at least $Omega(sqrt{T})$, even when the learner receives the full information feedback and $mathcal{F}$ is known.
The notion of emph{policy regret} in online learning is a well defined? performance measure for the common scenario of adaptive adversaries, which more traditional quantities such as external regret do not take into account. We revisit the notion of policy regret and first show that there are online learning settings in which policy regret and external regret are incompatible: any sequence of play that achieves a favorable regret with respect to one definition must do poorly with respect to the other. We then focus on the game-theoretic setting where the adversary is a self-interested agent. In that setting, we show that external regret and policy regret are not in conflict and, in fact, that a wide class of algorithms can ensure a favorable regret with respect to both definitions, so long as the adversary is also using such an algorithm. We also show that the sequence of play of no-policy regret algorithms converges to a emph{policy equilibrium}, a new notion of equilibrium that we introduce. Relating this back to external regret, we show that coarse correlated equilibria, which no-external regret players converge to, are a strict subset of policy equilibria. Thus, in game-theoretic settings, every sequence of play with no external regret also admits no policy regret, but the converse does not hold.
Consider a player that in each round $t$ out of $T$ rounds chooses an action and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that even if the players algori thms lose their no regret property due to too large delays, the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have no discounted-regret. For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a regret of $Oleft(nT^{frac{3}{4}}+sqrt{n}T^{frac{1}{3}}D^{frac{1}{3}}right)$ and the EXP3 algorithm with $K$ arms achieves a regret of $Oleft(sqrt{ln Kleft(KT+Dright)}right)$ even when $D=sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that EXP3 and FKM have no discounted-regret even for $d_{t}=Oleft(tlog tright)$. Therefore, the CCE of a finite or convex unknown game can be approximated even when only delayed bandit feedback is available via simulation.
We present a general framework for proving polynomial sample complexity bounds for the problem of learning from samples the best auction in a class of simple auctions. Our framework captures all of the most prominent examples of simple auctions, incl uding anonymous and non-anonymous item and bundle pricings, with either a single or multiple buyers. The technique we propose is to break the analysis of auctions into two natural pieces. First, one shows that the set of allocation rules have large amounts of structure; second, fixing an allocation on a sample, one shows that the set of auctions agreeing with this allocation on that sample have revenue functions with low dimensionality. Our results effectively imply that whenever its possible to compute a near-optimal simple auction with a known prior, it is also possible to compute such an auction with an unknown prior (given a polynomial number of samples).
We study a family of convex polytopes, called SIM-bodies, which were introduced by Giannakopoulos and Koutsoupias (2018) to analyze so-called Straight-Jacket Auctions. First, we show that the SIM-bodies belong to the class of generalized permutahedra . Second, we prove an optimality result for the Straight-Jacket Auctions among certain deterministic auctions. Third, we employ computer algebra methods and mathematical software to explicitly determine optimal prices and revenues.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا