Enabling SQL-based Training Data Debugging for Federated Learning

71 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Weiyuan Wu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Yejia Liu - Weiyuan Wu - Lampros Flokas

التعلم الآلي التشفير والأمن قواعد البيانات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

How can we debug a logistical regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework automatically removes the label errors from training data such that the unexpected behavior disappears in the retrained model. In this paper, we enable this powerful framework for federated learning. The key challenge is how to develop a security protocol for federated debugging which is proved to be secure, efficient, and accurate. Achieving this goal requires us to investigate how to seamlessly integrate the techniques from multiple fields (Databases, Machine Learning, and Cybersecurity). We first propose FedRain, which extends Rain, the state-of-the-art SQL-based training data debugging framework, to our federated learning setting. We address several technical challenges to make FedRain work and analyze its security guarantee and time complexity. The analysis results show that FedRain falls short in terms of both efficiency and security. To overcome these limitations, we redesign our security protocol and propose Frog, a novel SQL-based training data debugging framework tailored for federated learning. Our theoretical analysis shows that Frog is more secure, more accurate, and more efficient than FedRain. We conduct extensive experiments using several real-world datasets and a case study. The experimental results are consistent with our theoretical analysis and validate the effectiveness of Frog in practice.

قيم البحث

185 - Rulin Shao , Hongyu He , Hui Liu 2019

Artificial neural network has achieved unprecedented success in the medical domain. This success depends on the availability of massive and representative datasets. However, data collection is often prevented by privacy concerns and people want to ta ke control over their sensitive information during both training and using processes. To address this problem, we propose a privacy-preserving method for the distributed system, Stochastic Channel-Based Federated Learning (SCBF), which enables the participants to train a high-performance model cooperatively without sharing their inputs. Specifically, we design, implement and evaluate a channel-based update algorithm for the central server in a distributed system, which selects the channels with regard to the most active features in a training loop and uploads them as learned information from local datasets. A pruning process is applied to the algorithm based on the validation set, which serves as a model accelerator. In the experiment, our model presents better performances and higher saturating speed than the Federated Averaging method which reveals all the parameters of local models to the server when updating. We also demonstrate that the saturating rate of performance could be promoted by introducing a pruning process. And further improvement could be achieved by tuning the pruning rate. Our experiment shows that 57% of the time is saved by the pruning process with only a reduction of 0.0047 in AUCROC performance and a reduction of 0.0068 in AUCPR.

التعلم الآلي التشفير والأمن النظم الموزعة والتوازية والحوسبة العنقودية

Privacy-Preserving Self-Taught Federated Learning for Heterogeneous Data

105 - Kai-Fung Chu , Lintao Zhang 2021

Many application scenarios call for training a machine learning model among multiple participants. Federated learning (FL) was proposed to enable joint training of a deep learning model using the local data in each party without revealing the data to others. Among various types of FL methods, vertical FL is a category to handle data sources with the same ID space and different feature spaces. However, existing vertical FL methods suffer from limitations such as restrictive neural network structure, slow training speed, and often lack the ability to take advantage of data with unmatched IDs. In this work, we propose an FL method called self-taught federated learning to address the aforementioned issues, which uses unsupervised feature extraction techniques for distributed supervised deep learning tasks. In this method, only latent variables are transmitted to other parties for model training, while privacy is preserved by storing the data and parameters of activations, weights, and biases locally. Extensive experiments are performed to evaluate and demonstrate the validity and efficiency of the proposed method.

التعلم الآلي التشفير والأمن

Eavesdrop the Composition Proportion of Training Labels in Federated Learning

86 - Lixu Wang , Shichao Xu , Xiao Wang 2019

Federated learning (FL) has recently emerged as a new form of collaborative machine learning, where a common model can be learned while keeping all the training data on local devices. Although it is designed for enhancing the data privacy, we demonst rated in this paper a new direction in inference attacks in the context of FL, where valuable information about training data can be obtained by adversaries with very limited power. In particular, we proposed three new types of attacks to exploit this vulnerability. The first type of attack, Class Sniffing, can detect whether a certain label appears in training. The other two types of attacks can determine the quantity of each label, i.e., Quantity Inference attack determines the composition proportion of the training label owned by the selected clients in a single round, while Whole Determination attack determines that of the whole training process. We evaluated our attacks on a variety of tasks and datasets with different settings, and the corresponding results showed that our attacks work well generally. Finally, we analyzed the impact of major hyper-parameters to our attacks and discussed possible defenses.

التعلم الآلي التشفير والأمن أنظمة متعددة العملاء

Dopamine: Differentially Private Federated Learning on Medical Data

318 - Mohammad Malekzadeh , Burak Hasircioglu , Nitish Mital 2021

While rich medical datasets are hosted in hospitals distributed across the world, concerns on patients privacy is a barrier against using such data to train deep neural networks (DNNs) for medical diagnostics. We propose Dopamine, a system to train D NNs on distributed datasets, which employs federated learning (FL) with differentially-private stochastic gradient descent (DPSGD), and, in combination with secure aggregation, can establish a better trade-off between differential privacy (DP) guarantee and DNNs accuracy than other approaches. Results on a diabetic retinopathy~(DR) task show that Dopamine provides a DP guarantee close to the centralized training counterpart, while achieving a better classification accuracy than FL with parallel DP where DPSGD is applied without coordination. Code is available at https://github.com/ipc-lab/private-ml-for-health.

التعلم الآلي التشفير والأمن النظم الموزعة والتوازية والحوسبة العنقودية

FedXGBoost: Privacy-Preserving XGBoost for Federated Learning

186 - Nhan Khanh Le , Yang Liu , Quang Minh Nguyen 2021

Federated learning is the distributed machine learning framework that enables collaborative training across multiple parties while ensuring data privacy. Practical adaptation of XGBoost, the state-of-the-art tree boosting framework, to federated lear ning remains limited due to high cost incurred by conventional privacy-preserving methods. To address the problem, we propose two variants of federated XGBoost with privacy guarantee: FedXGBoost-SMM and FedXGBoost-LDP. Our first protocol FedXGBoost-SMM deploys enhanced secure matrix multiplication method to preserve privacy with lossless accuracy and lower overhead than encryption-based techniques. Developed independently, the second protocol FedXGBoost-LDP is heuristically designed with noise perturbation for local differential privacy, and empirically evaluated on real-world and synthetic datasets.

التعلم الآلي التشفير والأمن