We introduce a new and rigorously-formulated PAC-Bayes few-shot meta-learning algorithm that implicitly learns a prior distribution of the model of interest. Our proposed method extends the PAC-Bayes framework from a single task setting to the few-shot learning setting to upper-bound generalisation errors on unseen tasks and samples. We also propose a generative-based approach to model the shared prior and the posterior of task-specific model parameters more expressively compared to the usual diagonal Gaussian assumption. We show that the models trained with our proposed meta-learning algorithm are well calibrated and accurate, with state-of-the-art calibration and classification results on few-shot classification (mini-ImageNet and tiered-ImageNet) and regression (multi-modal task-distribution regression) benchmarks.
By leveraging experience from previous tasks, meta-learning algorithms can achieve effective fast adaptation ability when encountering new tasks. However it is unclear how the generalization property applies to new tasks. Probably approximately correct (PAC) Bayes bound theory provides a theoretical framework to analyze the generalization performance for meta-learning. We derive three novel generalisation error bounds for meta-learning based on PAC-Bayes relative entropy bound. Furthermore, using the empirical risk minimization (ERM) method, a PAC-Bayes bound for meta-learning with data-dependent prior is developed. Experiments illustrate that the proposed three PAC-Bayes bounds for meta-learning guarantee a competitive generalization performance guarantee, and the extended PAC-Bayes bound with data-dependent prior can achieve rapid convergence ability.
Adding domain knowledge to a learning system is known to improve results. In multi-parameter Bayesian frameworks, such knowledge is incorporated as a prior. On the other hand, various model parameters can have different learning rates in real-world problems, especially with skewed data. Two often-faced challenges in Operation Management and Management Science applications are the absence of informative priors, and the inability to control parameter learning rates. In this study, we propose a hierarchical Empirical Bayes approach that addresses both challenges, and that can generalize to any Bayesian framework. Our method learns empirical meta-priors from the data itself and uses them to decouple the learning rates of first-order and second-order features (or any other given feature grouping) in a Generalized Linear Model. As the first-order features are likely to have a more pronounced effect on the outcome, focusing on learning first-order weights first is likely to improve performance and convergence time. Our Empirical Bayes method clamps features in each group together and uses the deployed models observed data to empirically compute a hierarchical prior in hindsight. We report theoretical results for the unbiasedness, strong consistency, and optimal frequentist cumulative regret properties of our meta-prior variance estimator. We apply our method to a standard supervised learning optimization problem, as well as an online combinatorial optimization problem in a contextual bandit setting implemented in an Amazon production system. Both during simulations and live experiments, our method shows marked improvements, especially in cases of small traffic. Our findings are promising, as optimizing over sparse data is often a challenge.
A significant challenge for the practical application of reinforcement learning in the real world is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert behavior. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a prior that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior.
Many practical machine learning tasks can be framed as Structured prediction problems, where several output variables are predicted and considered interdependent. Recent theoretical advances in structured prediction have focused on obtaining fast rates convergence guarantees, especially in the Implicit Loss Embedding (ILE) framework. PAC-Bayes has gained interest recently for its capacity of producing tight risk bounds for predictor distributions. This work proposes a novel PAC-Bayes perspective on the ILE Structured prediction framework. We present two generalization bounds, on the risk and excess risk, which yield insights into the behavior of ILE predictors. Two learning algorithms are derived from these bounds. The algorithms are implemented and their behavior analyzed, with source code available at url{https://github.com/theophilec/PAC-Bayes-ILE-Structured-Prediction}.
Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.