No Arabic abstract
In this paper, we propose generating artificial data that retain statistical properties of real data as the means of providing privacy with respect to the original dataset. We use generative adversarial network to draw privacy-preserving artificial data samples and derive an empirical method to assess the risk of information disclosure in a differential-privacy-like way. Our experiments show that we are able to generate artificial data of high quality and successfully train and validate machine learning models on this data while limiting potential privacy loss.
Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of the same training algorithm produce models with large performance variance. To address these issues, we propose DPlis--Differentially Private Learning wIth Smoothing. The core idea of DPlis is to construct a smooth loss function that favors noise-resilient models lying in large flat regions of the loss landscape. We provide theoretical justification for the utility improvements of DPlis. Extensive experiments also demonstrate that DPlis can effectively boost model quality and training stability under a given privacy budget.
This paper studies the relationship between generalization and privacy preservation in iterative learning algorithms by two sequential steps. We first establish an alignment between generalization and privacy preservation for any learning algorithm. We prove that $(varepsilon, delta)$-differential privacy implies an on-average generalization bound for multi-database learning algorithms which further leads to a high-probability bound for any learning algorithm. This high-probability bound also implies a PAC-learnable guarantee for differentially private learning algorithms. We then investigate how the iterative nature shared by most learning algorithms influence privacy preservation and further generalization. Three composition theorems are proposed to approximate the differential privacy of any iterative algorithm through the differential privacy of its every iteration. By integrating the above two steps, we eventually deliver generalization bounds for iterative learning algorithms, which suggest one can simultaneously enhance privacy preservation and generalization. Our results are strictly tighter than the existing works. Particularly, our generalization bounds do not rely on the model size which is prohibitively large in deep learning. This sheds light to understanding the generalizability of deep learning. These results apply to a wide spectrum of learning algorithms. In this paper, we apply them to stochastic gradient Langevin dynamics and agnostic federated learning as examples.
While rich medical datasets are hosted in hospitals distributed across the world, concerns on patients privacy is a barrier against using such data to train deep neural networks (DNNs) for medical diagnostics. We propose Dopamine, a system to train DNNs on distributed datasets, which employs federated learning (FL) with differentially-private stochastic gradient descent (DPSGD), and, in combination with secure aggregation, can establish a better trade-off between differential privacy (DP) guarantee and DNNs accuracy than other approaches. Results on a diabetic retinopathy~(DR) task show that Dopamine provides a DP guarantee close to the centralized training counterpart, while achieving a better classification accuracy than FL with parallel DP where DPSGD is applied without coordination. Code is available at https://github.com/ipc-lab/private-ml-for-health.
Differentially private stochastic gradient descent (DPSGD) is a variation of stochastic gradient descent based on the Differential Privacy (DP) paradigm which can mitigate privacy threats arising from the presence of sensitive information in training data. One major drawback of training deep neural networks with DPSGD is a reduction in the models accuracy. In this paper, we propose an alternative method for preserving data privacy based on introducing noise through learnable probability distributions, which leads to a significant improvement in the utility of the resulting private models. We also demonstrate that normalization layers have a large beneficial impact on the performance of deep neural networks with noisy parameters. In particular, we show that contrary to general belief, a large amount of random noise can be added to the weights of neural networks without harming the performance, once the networks are augmented with normalization layers. We hypothesize that this robustness is a consequence of the scale invariance property of normalization operators. Building on these observations, we propose a new algorithmic technique for training deep neural networks under very low privacy budgets by sampling weights from Gaussian distributions and utilizing batch or layer normalization techniques to prevent performance degradation. Our method outperforms previous approaches, including DPSGD, by a substantial margin on a comprehensive set of experiments on Computer Vision and Natural Language Processing tasks. In particular, we obtain a 20 percent accuracy improvement over DPSGD on the MNIST and CIFAR10 datasets with DP-privacy budgets of $varepsilon = 0.05$ and $varepsilon = 2.0$, respectively. Our code is available online: https://github.com/uds-lsv/SIDP.
Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore reveal sensitive information. To address this problem, we demonstrate a generally applicable approach to providing strong privacy guarantees for training data: Private Aggregation of Teacher Ensembles (PATE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as teachers for a student model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The students privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the students training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-convex models like DNNs. We achieve state-of-the-art privacy/utility trade-offs on MNIST and SVHN thanks to an improved privacy analysis and semi-supervised learning.