No Arabic abstract
Deep learning models are considered to be state-of-the-art in many offline machine learning tasks. However, many of the techniques developed are not suitable for online learning tasks. The problem of using deep learning models with sequential data becomes even harder when several loss functions need to be considered simultaneously, as in many real-world applications. In this paper, we, therefore, propose a novel online deep learning training procedure which can be used regardless of the neural networks architecture, aiming to deal with the multiple objectives case. We demonstrate and show the effectiveness of our algorithm on the Neyman-Pearson classification problem on several benchmark datasets.
Online learning to rank is a core problem in information retrieval and machine learning. Many provably efficient algorithms have been recently proposed for this problem in specific click models. The click model is a model of how the user interacts with a list of documents. Though these results are significant, their impact on practice is limited, because all proposed algorithms are designed for specific click models and lack convergence guarantees in other models. In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models. The class encompasses two most fundamental click models, the cascade and position-based models. We derive a gap-dependent upper bound on the $T$-step regret of BatchRank and evaluate it on a range of web search queries. We observe that BatchRank outperforms ranked bandits and is more robust than CascadeKL-UCB, an existing algorithm for the cascade model.
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a certain threshold $tau$. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an $widetilde{mathcal{O}}(frac{dsqrt{T}}{tau-c_0})$ bound on its $T$-round regret, where the denominator is the difference between the constraint threshold and the cost of a known feasible action. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting. We prove a regret bound of $widetilde{mathcal{O}}(frac{sqrt{KT}}{tau - c_0})$ for this algorithm in $K$-armed bandits, which is a $sqrt{K}$ improvement over the regret bound we obtain by simply casting multi-armed bandits as an instance of contextual linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results.
We aim to jointly optimize antenna tilt angle, and vertical and horizontal half-power beamwidths of the macrocells in a heterogeneous cellular network (HetNet). The interactions between the cells, most notably due to their coupled interference render this optimization prohibitively complex. Utilizing a single agent reinforcement learning (RL) algorithm for this optimization becomes quite suboptimum despite its scalability, whereas multi-agent RL algorithms yield better solutions at the expense of scalability. Hence, we propose a compromise algorithm between these two. Specifically, a multi-agent mean field RL algorithm is first utilized in the offline phase so as to transfer information as features for the second (online) phase single agent RL algorithm, which employs a deep neural network to learn users locations. This two-step approach is a practical solution for real deployments, which should automatically adapt to environmental changes in the network. Our results illustrate that the proposed algorithm approaches the performance of the multi-agent RL, which requires millions of trials, with hundreds of online trials, assuming relatively low environmental dynamics, and performs much better than a single agent RL. Furthermore, the proposed algorithm is compact and implementable, and empirically appears to provide a performance guarantee regardless of the amount of environmental dynamics.
Recent advances in deep reinforcement learning have achieved human-level performance on a variety of real-world applications. However, the current algorithms still suffer from poor gradient estimation with excessive variance, resulting in unstable training and poor sample efficiency. In our paper, we proposed an innovative optimization strategy by utilizing stochastic variance reduced gradient (SVRG) techniques. With extensive experiments on Atari domain, our method outperforms the deep q-learning baselines on 18 out of 20 games.
Transfer learning has been demonstrated to be successful and essential in diverse applications, which transfers knowledge from related but different source domains to the target domain. Online transfer learning(OTL) is a more challenging problem where the target data arrive in an online manner. Most OTL methods combine source classifier and target classifier directly by assigning a weight to each classifier, and adjust the weights constantly. However, these methods pay little attention to reducing the distribution discrepancy between domains. In this paper, we propose a novel online transfer learning method which seeks to find a new feature representation, so that the marginal distribution and conditional distribution discrepancy can be online reduced simultaneously. We focus on online transfer learning with multiple source domains and use the Hedge strategy to leverage knowledge from source domains. We analyze the theoretical properties of the proposed algorithm and provide an upper mistake bound. Comprehensive experiments on two real-world datasets show that our method outperforms state-of-the-art methods by a large margin.