No Arabic abstract
Given multiple source domains, domain generalization aims at learning a universal model that performs well on any unseen but related target domain. In this work, we focus on the domain generalization scenario where domain shifts occur among class-conditional distributions of different domains. Existing approaches are not sufficiently robust when the variation of conditional distributions given the same class is large. In this work, we extend the concept of distributional robust optimization to solve the class-conditional domain generalization problem. Our approach optimizes the worst-case performance of a classifier over class-conditional distributions within a Wasserstein ball centered around the barycenter of the source conditional distributions. We also propose an iterative algorithm for learning the optimal radius of the Wasserstein balls automatically. Experiments show that the proposed framework has better performance on unseen target domain than approaches without domain generalization.
While deep neural networks demonstrate state-of-the-art performance on a variety of learning tasks, their performance relies on the assumption that train and test distributions are the same, which may not hold in real-world applications. Domain generalization addresses this issue by employing multiple source domains to build robust models that can generalize to unseen target domains subject to shifts in data distribution. In this paper, we propose Domain-Free Domain Generalization (DFDG), a model-agnostic method to achieve better generalization performance on the unseen test domain without the need for source domain labels. DFDG uses novel strategies to learn domain-invariant class-discriminative features. It aligns class relationships of samples through class-conditional soft labels, and uses saliency maps, traditionally developed for post-hoc analysis of image classification networks, to remove superficial observations from training inputs. DFDG obtains competitive performance on both time series sensor and image classification public datasets.
Many machine learning tasks involve subpopulation shift where the testing data distribution is a subpopulation of the training distribution. For such settings, a line of recent work has proposed the use of a variant of empirical risk minimization(ERM) known as distributionally robust optimization (DRO). In this work, we apply DRO to real, large-scale tasks with subpopulation shift, and observe that DRO performs relatively poorly, and moreover has severe instability. We identify one direct cause of this phenomenon: sensitivity of DRO to outliers in the datasets. To resolve this issue, we propose the framework of DORO, for Distributional and Outlier Robust Optimization. At the core of this approach is a refined risk function which prevents DRO from overfitting to potential outliers. We instantiate DORO for the Cressie-Read family of Renyi divergence, and delve into two specific instances of this family: CVaR and $chi^2$-DRO. We theoretically prove the effectiveness of the proposed method, and empirically show that DORO improves the performance and stability of DRO with experiments on large modern datasets, thereby positively addressing the open question raised by Hashimoto et al., 2018.
Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.
We address the problem of unsupervised domain adaptation (UDA) by learning a cross-domain agnostic embedding space, where the distance between the probability distributions of the two source and target visual domains is minimized. We use the output space of a shared cross-domain deep encoder to model the embedding space anduse the Sliced-Wasserstein Distance (SWD) to measure and minimize the distance between the embedded distributions of two source and target domains to enforce the embedding to be domain-agnostic.Additionally, we use the source domain labeled data to train a deep classifier from the embedding space to the label space to enforce the embedding space to be discriminative.As a result of this training scheme, we provide an effective solution to train the deep classification network on the source domain such that it will generalize well on the target domain, where only unlabeled training data is accessible. To mitigate the challenge of class matching, we also align corresponding classes in the embedding space by using high confidence pseudo-labels for the target domain, i.e. assigning the class for which the source classifier has a high prediction probability. We provide experimental results on UDA benchmark tasks to demonstrate that our method is effective and leads to state-of-the-art performance.
Investigation of machine learning algorithms robust to changes between the training and test distributions is an active area of research. In this paper we explore a special type of dataset shift which we call class-dependent domain shift. It is characterized by the following features: the input data causally depends on the label, the shift in the data is fully explained by a known variable, the variable which controls the shift can depend on the label, there is no shift in the label distribution. We define a simple optimization problem with an information theoretic constraint and attempt to solve it with neural networks. Experiments on a toy dataset demonstrate the proposed method is able to learn robust classifiers which generalize well to unseen domains.