No Arabic abstract
Motivated by a natural problem in online model selection with bandit information, we introduce and analyze a best arm identification problem in the rested bandit setting, wherein arm expected losses decrease with the number of times the arm has been played. The shape of the expected loss functions is similar across arms, and is assumed to be available up to unknown parameters that have to be learned on the fly. We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game. We analyze an arm elimination algorithm whose regret vanishes as the time horizon increases. The actual rate of convergence depends in a detailed way on the postulated functional form of the expected losses. Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function so as to identify the best arm as quickly as possible. We complement our analysis with a lower bound, indicating strengths and limitations of the proposed solution.
Online feature selection has been an active research area in recent years. We propose a novel diverse online feature selection method based on Determinantal Point Processes (DPP). Our model aims to provide diverse features which can be composed in either a supervised or unsupervised framework. The framework aims to promote diversity based on the kernel produced on a feature level, through at most three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. The local criteria is used to assess and select streamed features (i.e. only when they arrive), we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly, the global criteria uses regularization methods to select a global optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met. We demonstrate based on experiments conducted on that this approach yields better compactness, is comparable and in some instances outperforms other state-of-the-art online feature selection methods.
Financial markets are complex environments that produce enormous amounts of noisy and non-stationary data. One fundamental problem is online portfolio selection, the goal of which is to exploit this data to sequentially select portfolios of assets to achieve positive investment outcomes while managing risks. Various algorithms have been proposed for solving this problem in fields such as finance, statistics and machine learning, among others. Most of the methods have parameters that are estimated from backtests for good performance. Since these algorithms operate on non-stationary data that reflects the complexity of financial markets, we posit that adaptively tuning these parameters in an intelligent manner is a remedy for dealing with this complexity. In this paper, we model the mapping between the parameter space and the space of performance metrics using a Gaussian process prior. We then propose an oracle based on adaptive Bayesian optimization for automatically and adaptively configuring online portfolio selection methods. We test the efficacy of our solution on algorithms operating on equity and index data from various markets.
A challenge that machine learning practitioners in the industry face is the task of selecting the best model to deploy in production. As a model is often an intermediate component of a production system, online controlled experiments such as A/B tests yield the most reliable estimation of the effectiveness of the whole system, but can only compare two or a few models due to budget constraints. We propose an automated online experimentation mechanism that can efficiently perform model selection from a large pool of models with a small number of online experiments. We derive the probability distribution of the metric of interest that contains the model uncertainty from our Bayesian surrogate model trained using historical logs. Our method efficiently identifies the best model by sequentially selecting and deploying a list of models from the candidate set that balance exploration-exploitation. Using simulations based on real data, we demonstrate the effectiveness of our method on two different tasks.
We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $ tilde{mathcal{O}}(ssqrt{T})$ regret with high probability, which is nearly independent in the ``ambient regression model dimension $d$. We further attain a sharper $tilde{mathcal{O}}(sqrt{sT})$ regret by using the textsc{SupLinUCB} framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.
We study a bandit version of phase retrieval where the learner chooses actions $(A_t)_{t=1}^n$ in the $d$-dimensional unit ball and the expected reward is $langle A_t, theta_starrangle^2$ where $theta_star in mathbb R^d$ is an unknown parameter vector. We prove that the minimax cumulative regret in this problem is $smash{tilde Theta(d sqrt{n})}$, which improves on the best known bounds by a factor of $smash{sqrt{d}}$. We also show that the minimax simple regret is $smash{tilde Theta(d / sqrt{n})}$ and that this is only achievable by an adaptive algorithm. Our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading and that uniform bounds on the information ratio for information-directed sampling are not sufficient for optimal regret.