ﻻ يوجد ملخص باللغة العربية
We consider a novel data driven approach for designing learning algorithms that can effectively learn with only a small number of labeled examples. This is crucial for modern machine learning applications where labels are scarce or expensive to obtain. We focus on graph-based techniques, where the unlabeled examples are connected in a graph under the implicit assumption that similar nodes likely have similar labels. Over the past decades, several elegant graph-based semi-supervised and active learning algorithms for how to infer the labels of the unlabeled examples given the graph and a few labeled examples have been proposed. However, the problem of how to create the graph (which impacts the practical usefulness of these methods significantly) has been relegated to domain-specific art and heuristics and no general principles have been proposed. In this work we present a novel data driven approach for learning the graph and provide strong formal guarantees in both the distributional and online learning formalizations. We show how to leverage problem instances coming from an underlying problem domain to learn the graph hyperparameters from commonly used parametric families of graphs that perform well on new instances coming from the same domain. We obtain low regret and efficient algorithms in the online setting, and generalization guarantees in the distributional setting. We also show how to combine several very different similarity metrics and learn multiple hyperparameters, providing general techniques to apply to large classes of problems. We expect some of the tools and techniques we develop along the way to be of interest beyond semi-supervised and active learning, for data driven algorithms for combinatorial problems more generally.
Representation learning has been proven to play an important role in the unprecedented success of machine learning models in numerous tasks, such as machine translation, face recognition and recommendation. The majority of existing representation lea
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many in
Distantly-labeled data can be used to scale up training of statistical models, but it is typically noisy and that noise can vary with the distant labeling technique. In this work, we propose a two-stage procedure for handling this type of data: denoi
Noisy labeled data is more a norm than a rarity for self-generated content that is continuously published on the web and social media. Due to privacy concerns and governmental regulations, such a data stream can only be stored and used for learning p
In the field of machine learning there is a growing interest towards more robust and generalizable algorithms. This is for example important to bridge the gap between the environment in which the training data was collected and the environment where