We study Online Convex Optimization in the unbounded setting where neither predictions nor gradient are constrained. The goal is to simultaneously adapt to both the sequence of gradients and the comparator. We first develop parameter-free and scale-free algorithms for a simplified setting with hints. We present t
We aim to design adaptive online learning algorithms that take advantage of any special structure that might be present in the learning task at hand, with as little manual tuning by the user as possible. A fundamental obstacle that comes up in the de
sign of such adaptive algorithms is to calibrate a so-called step-size or learning rate hyperparameter depending on variance, gradient norms, etc. A recent technique promises to overcome this difficulty by maintaining multiple learning rates in parallel. This technique has been applied in the MetaGrad algorithm for online convex optimization and the Squint algorithm for prediction with expert advice. However, in both cases the user still has to provide in advance a Lipschitz hyperparameter that bounds the norm of the gradients. Although this hyperparameter is typically not available in advance, tuning it correctly is crucial: if it is set too small, the methods may fail completely; but if it is taken too large, performance deteriorates significantly. In the present work we remove this Lipschitz hyperparameter by designing n
Deep neural networks are considered to be state of the art models in many offline machine learning tasks. However, their performance and generalization abilities in online learning tasks are much less understood. Therefore, we focus on online learnin
g and tackle the challenging problem where the underlying process is stationary and ergodic and thus removing the i.i.d. assumption and allowing observations to depend on each other arbitrarily. We prove the generalization abilities of Lipschitz regularized deep neural networks and show that by using those networks, a convergence to the best possible prediction strategy is guaranteed.
We propose a novel framework, called Markov-Lipschitz deep learning (MLDL), to tackle geometric deterioration caused by collapse, twisting, or crossing in vector-based neural network transformations for manifold-based representation learning and mani
fold data generation. A prior constraint, called locally isometric smoothness (LIS), is imposed across-layers and encoded into a Markov random field (MRF)-Gibbs distribution. This leads to the best possible solutions for local geometry preservation and robustness as measured by locally geometric distortion and locally bi-Lipschitz continuity. Consequently, the layer-wise vector transformations are enhanced into well-behaved, LIS-constrained metric homeomorphisms. Extensive experiments, comparisons, and ablation study demonstrate significant advantages of MLDL for manifold learning and manifold data generation. MLDL is general enough to enhance any vector transformation-based networks. The code is available at https://github.com/westlake-cairi/Markov-Lipschitz-Deep-Learning.
Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most gen
eral online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of $mathit{static}$ regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to $s$ environment shifts. In this work we provide an $O(sqrt{sdTlog T}+sT^{1-beta})$ regret bound for $beta$-dispersed functions, where $beta$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $betage1/2$ in problems of practical interest). We also present a lower bound tight up to sub-logarithmic factors. We further obtain improved bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems on popular benchmarks.
The federated learning (FL) framework trains a machine learning model using decentralized data stored at edge client devices by periodically aggregating locally trained models. Popular optimization algorithms of FL use vanilla (stochastic) gradient d
escent for both local updates at clients and global updates at the aggregating server. Recently, adaptive optimization methods such as AdaGrad have been studied for server updates. However, the effect of using adaptive optimization methods for local updates at clients is not yet understood. We show in both theory and practice that while local adaptive methods can accelerate convergence, they can cause a non-vanishing solution bias, where the final converged solution may be different from the stationary point of the global objective function. We propose correction techniques to overcome this inconsistency and complement the local adaptive methods for FL. Extensive experiments on realistic federated training tasks show that the proposed algorithms can achieve faster convergence and higher test accuracy than the baselines without local adaptivity.