No Arabic abstract
In this work, we aim to create a completely online algorithmic framework for prediction with expert advice that is translation-free and scale-free of the expert losses. Our goal is to create a generalized algorithm that is suitable for use in a wide variety of applications. For this purpose, we study the expected regret of our algorithm against a generic competition class in the sequential prediction by expert advice problem, where the expected regret measures the difference between the losses of our prediction algorithm and the losses of the best expert selection strategy in the competition. We design our algorithm using the universal prediction perspective to compete against a specified class of expert selection strategies, which is not necessarily a fixed expert selection. The class of expert selection strategies that we want to compete against is purely determined by the specific application at hand and is left generic, which makes our generalized algorithm suitable for use in many different problems. We show that no preliminary knowledge about the loss sequence is required by our algorithm and its performance bounds, which are second order, expressed in terms of sums of squared losses. Our regret bounds are stable under arbitrary scalings and translations of the losses.
This work addresses the classic machine learning problem of online prediction with expert advice. We consider the finite-horizon version of this zero-sum, two-person game. Using verification arguments from optimal control theory, we view the task of finding better lower and upper bounds on the value of the game (regret) as the problem of finding better sub- and supersolutions of certain partial differential equations (PDEs). These sub- and supersolutions serve as the potentials for player and adversary strategies, which lead to the corresponding bounds. To get explicit bounds, we use closed-form solutions of specific PDEs. Our bounds hold for any given number of experts and horizon; in certain regimes (which we identify) they improve upon the previous state of the art. For two and three experts, our bounds provide the optimal leading order term.
This work addresses a classic problem of online prediction with expert advice. We assume an adversarial opponent, and we consider both the finite-horizon and random-stoppi
This work addresses the classic machine learning problem of online prediction with expert advice. A new potential-based framework for the fixed horizon version of this problem has been recently developed using verification arguments from optimal control theory. This paper extends this framework to the random (geometric) stopping version. To obtain explicit bounds, we construct potentials for the geometric version from potentials used for the fixed horizon version of the problem. This construction leads to new explicit lower and upper bounds associated with specific adversary and player strategies. While there are several known lower bounds in the fixed horizon setting, our lower bounds appear to be the first such results in the geometric stopping setting with an arbitrary number of experts. Our framework also leads in some cases to improved upper bounds. For two and three experts, our bounds are optimal to leading order.
We derive an online learning algorithm with improved regret guarantees for `easy loss sequences. We consider two types of `easiness: (a) stochastic loss sequences and (b) adversarial loss sequences with small effective range of the losses. While a number of algorithms have been proposed for exploiting small effective range in the full information setting, Gerchinovitz and Lattimore [2016] have shown the impossibility of regret scaling with the effective range of the losses in the bandit setting. We show that just one additional observation per round is sufficient to circumvent the impossibility result. The proposed Second Order Difference Adjustments (SODA) algorithm requires no prior knowledge of the effective range of the losses, $varepsilon$, and achieves an $O(varepsilon sqrt{KT ln K}) + tilde{O}(varepsilon K sqrt[4]{T})$ expected regret guarantee, where $T$ is the time horizon and $K$ is the number of actions. The scaling with the effective loss range is achieved under significantly weaker assumptions than those made by Cesa-Bianchi and Shamir [2018] in an earlier attempt to circumvent the impossibility result. We also provide a regret lower bound of $Omega(varepsilonsqrt{T K})$, which almost matches the upper bound. In addition, we show that in the stochastic setting SODA achieves an $Oleft(sum_{a:Delta_a>0} frac{K^3 varepsilon^2}{Delta_a}right)$ pseudo-regret bound that holds simultaneously with the adversarial regret guarantee. In other words, SODA is safe against an unrestricted oblivious adversary and provides improved regret guarantees for at least two different types of `easiness simultaneously.
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $tilde{O}(sqrt{T})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.