No Arabic abstract
We consider stochastic convex optimization problems, where several machines act asynchronously in parallel while sharing a common memory. We propose a robust training method for the constrained setting and derive non asymptotic convergence guarantees that do not depend on prior knowledge of update delays, objective smoothness, and gradient variance. Conversely, existing methods for this setting crucially rely on this prior knowledge, which render them unsuitable for essentially all shared-resources computational environments, such as clouds and data centers. Concretely, existing approaches are unable to accommodate changes in the delays which result from dynamic allocation of the machines, while our method implicitly adapts to such changes.
The increasing connectivity of data and cyber-physical systems has resulted in a growing number of cyber-attacks. Real-time detection of such attacks, through the identification of anomalous activity, is required so that mitigation and contingent actions can be effectively and rapidly deployed. We propose a new approach for aggregating unsupervised anomaly detection algorithms and incorporating feedback when it becomes available. We apply this approach to open-source real datasets and show that both aggregating models, which we call experts, and incorporating feedback significantly improve the performance. An important property of the proposed approaches is their theoretical guarantees that they perform close to the best superexpert, which can switch between the best performing experts, in terms of the cumulative average losses.
This paper addresses the problem of multiclass classification with corrupted or noisy bandit feedback. In this setting, the learner may not receive true feedback. Instead, it receives feedback that has been flipped with some non-zero probability. We propose a novel approach to deal with noisy bandit feedback based on the unbiased estimator technique. We further offer a method that can efficiently estimate the noise rates, thus providing an end-to-end framework. The proposed algorithm enjoys a mistake bound of the order of $O(sqrt{T})$ in the high noise case and of the order of $O(T^{ icefrac{2}{3}})$ in the worst case. We show our approachs effectiveness using extensive experiments on several benchmark datasets.
In situations where explicit communication is limited, human collaborators act by learning to: (i) infer meaning behind their partners actions, and (ii) convey private information about the state to their partner implicitly through actions. The first component of this learning process has been well-studied in multi-agent systems, whereas the second --- which is equally crucial for successful collaboration --- has not. To mimic both components mentioned above, thereby completing the learning process, we introduce a novel algorithm: Policy Belief Learning (PBL). PBL uses a belief module to model the other agents private information and a policy module to form a distribution over actions informed by the belief module. Furthermore, to encourage communication by actions, we propose a novel auxiliary reward which incentivizes one agent to help its partner to make correct inferences about its private information. The auxiliary reward for communication is integrated into the learning of the policy module. We evaluate our approach on a set of environments including a matrix game, particle environment and the non-competitive bidding problem from contract bridge. We show empirically that this auxiliary reward is effective and easy to generalize. These results demonstrate that our PBL algorithm can produce strong pairs of agents in collaborative games where explicit communication is disabled.
We explore a novel setting of the Multi-Armed Bandit (MAB) problem inspired from real world applications which we call bandits with stochastic delayed composite anonymous feedback (SDCAF). In SDCAF, the rewards on pulling arms are stochastic with respect to time but spread over a fixed number of time steps in the future after pulling the arm. The complexity of this problem stems from the anonymous feedback to the player and the stochastic generation of the reward. Due to the aggregated nature of the rewards, the player is unable to associate the reward to a particular time step from the past. We present two algorithms for this more complicated setting of SDCAF using phase based extensions of the UCB algorithm. We perform regret analysis to show sub-linear theoretical guarantees on both the algorithms.
In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.