No Arabic abstract
We show that a single softmax neural net with minimal changes can beat the uncertainty predictions of Deep Ensembles and other more complex single-forward-pass uncertainty approaches. Standard softmax neural nets suffer from feature collapse and extrapolate arbitrarily for OoD points. This results in arbitrary softmax entropies for OoD points which can have high entropy, low, or anything in between, thus cannot capture epistemic uncertainty reliably. We prove that this failure lies at the core of why Deep Ensemble Uncertainty works well. Instead of using softmax entropy, we show that with appropriate inductive biases softmax neural nets trained with maximum likelihood reliably capture epistemic uncertainty through their feature-space density. This density is obtained using simple Gaussian Discriminant Analysis, but it cannot represent aleatoric uncertainty reliably. We show that it is necessary to combine feature-space density with softmax entropy to disentangle uncertainties well. We evaluate the epistemic uncertainty quality on active learning and OoD detection, achieving SOTA ~98 AUROC on CIFAR-10 vs SVHN without fine-tuning on OoD data.
We introduce the textit{epistemic neural network} (ENN) as an interface for uncertainty modeling in deep learning. All existing approaches to uncertainty modeling can be expressed as ENNs, and any ENN can be identified with a Bayesian neural network. However, this new perspective provides several promising directions for future research. Where prior work has developed probabilistic inference tools for neural networks; we ask instead, `which neural networks are suitable as tools for probabilistic inference?. We propose a clear and simple metric for progress in ENNs: the KL-divergence with respect to a target distribution. We develop a computational testbed based on inference in a neural network Gaussian process and release our code as a benchmark at url{https://github.com/deepmind/enn}. We evaluate several canonical approaches to uncertainty modeling in deep learning, and find they vary greatly in their performance. We provide insight to the sensitivity of these results and show that our metric is highly correlated with performance in sequential decision problems. Finally, we provide indications that new ENN architectures can improve performance in both the statistical quality and computational cost.
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond ones experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI. The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between hand-engineering and end-to-end learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. As a companion to this paper, we have released an open-source software library for building graph networks, with demonstrations of how to use them in practice.
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations. However, using weight decay (WD) benefits these weight-scale-invariant networks, which is often attributed to an increase of the effective learning rate when the weight norms are decreased. In this paper, we demonstrate the insufficiency of the previous explanation and investigate the implicit biases of stochastic gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay. We identity two implicit biases of SGD on BN-DNNs: 1) the weight norms in SGD training remain constant in the continuous-time domain and keep increasing in the discrete-time domain; 2) SGD optimizes weight vectors in fully-connected networks or convolution kernels in convolution neural networks by updating components lying in the input feature span, while leaving those components orthogonal to the input feature span unchanged. Thus, SGD without WD accumulates weight noise orthogonal to the input feature span, and cannot eliminate such noise. Our empirical studies corroborate the hypothesis that weight decay suppresses weight noise that is left untouched by SGD. Furthermore, we propose to use weight rescaling (WRS) instead of weight decay to achieve the same regularization effect, while avoiding performance degradation of WD on some momentum-based optimizers. Our empirical results on image recognition show that regardless of optimization methods and network architectures, training BN-DNNs using WRS achieves similar or better performance compared with using WD. We also show that training with WRS generalizes better compared to WD, on other computer vision tasks.
A set of novel approaches for estimating epistemic uncertainty in deep neural networks with a single forward pass has recently emerged as a valid alternative to Bayesian Neural Networks. On the premise of informative representations, these deterministic uncertainty methods (DUMs) achieve strong performance on detecting out-of-distribution (OOD) data while adding negligible computational costs at inference time. However, it remains unclear whether DUMs are well calibrated and can seamlessly scale to real-world applications - both prerequisites for their practical deployment. To this end, we first provide a taxonomy of DUMs, evaluate their calibration under continuous distributional shifts and their performance on OOD detection for image classification tasks. Then, we extend the most promising approaches to semantic segmentation. We find that, while DUMs scale to realistic vision tasks and perform well on OOD detection, the practicality of current methods is undermined by poor calibration under realistic distributional shifts.
The distribution of a neural networks latent representations has been successfully used to detect out-of-distribution (OOD) data. This work investigates whether this distribution moreover correlates with a models epistemic uncertainty, thus indicates its ability to generalise to novel inputs. We first empirically verify that epistemic uncertainty can be identified with the surprise, thus the negative log-likelihood, of observing a particular latent representation. Moreover, we demonstrate that the output-conditional distribution of hidden representations also allows quantifying aleatoric uncertainty via the entropy of the predictive distribution. We analyse epistemic and aleatoric uncertainty inferred from the representations of different layers and conclude that deeper layers lead to uncertainty with similar behaviour as established - but computationally more expensive - methods (e.g. deep ensembles). While our approach does not require modifying the training process, we follow prior work and experiment with an additional regularising loss that increases the information in the latent representations. We find that this leads to improved OOD detection of epistemic uncertainty at the cost of ambiguous calibration close to the data distribution. We verify our findings on both classification and regression models.