No Arabic abstract
The question of how government agencies can acquire actionable, useful information about legitimate but unknown targets without intruding upon the electronic activity of innocent parties is extremely important. We address this question by providing experimental evidence that actionable, useful information can indeed be obtained in a manner that preserves the privacy of innocent parties and that holds government agencies accountable. In particular, we present practical, privacy-preserving protocols for two operations that law-enforcement and intelligence agencies have used effectively: set intersection and contact chaining. Experiments with our protocols suggest that privacy-preserving contact chaining can perform a 3-hop privacy-preserving graph traversal producing 27,000 ciphertexts in under two minutes. These ciphertexts are usable in turn via privacy-preserving set intersection to pinpoint potential unknown targets within a body of 150,000 total ciphertexts within 10 minutes, without exposing personal information about non-targets.
In this paper, we study the problem of summation evaluation of secrets. The secrets are distributed over a network of nodes that form a ring graph. Privacy-preserving iterative protocols for computing the sum of the secrets are proposed, which are compatible with node join and leave situations. Theoretic bounds are derived regarding the utility and accuracy, and the proposed protocols are shown to comply with differential privacy requirements. Based on utility, accuracy and privacy, we also provide guidance on appropriate selections of random noise parameters. Additionally, a few numerical examples that demonstrate their effectiveness are provided.
In this paper, we present a general multiparty modeling paradigm with Privacy Preserving Principal Component Analysis (PPPCA) for horizontally partitioned data. PPPCA can accomplish multiparty cooperative execution of PCA under the premise of keeping plaintext data locally. We also propose implementations using two techniques, i.e., homomorphic encryption and secret sharing. The output of PPPCA can be sent directly to data consumer to build any machine learning models. We conduct experiments on three UCI benchmark datasets and a real-world fraud detection dataset. Results show that the accuracy of the model built upon PPPCA is the same as the model with PCA that is built based on centralized plaintext data.
The Domain Name System (DNS) was created to resolve the IP addresses of the web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g. for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
Differential privacy (DP) and local differential privacy (LPD) are frameworks to protect sensitive information in data collections. They are both based on obfuscation. In DP the noise is added to the result of queries on the dataset, whereas in LPD the noise is added directly on the individual records, before being collected. The main advantage of LPD with respect to DP is that it does not need to assume a trusted third party. The main disadvantage is that the trade-off between privacy and utility is usually worse than in DP, and typically to retrieve reasonably good statistics from the locally sanitized data it is necessary to have a huge collection of them. In this paper, we focus on the problem of estimating counting queries from collections of noisy answers, and we propose a variant of LDP based on the addition of geometric noise. Our main result is that the geometric noise has a better statistical utility than other LPD mechanisms from the literature.
Releasing full data records is one of the most challenging problems in data privacy. On the one hand, many of the popular techniques such as data de-identification are problematic because of their dependence on the background knowledge of adversaries. On the other hand, rigorous methods such as the exponential mechanism for differential privacy are often computationally impractical to use for releasing high dimensional data or cannot preserve high utility of original data due to their extensive data perturbation. This paper presents a criterion called plausible deniability that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter. This notion does not depend on the background knowledge of an adversary. Also, it can efficiently be checked by privacy tests. We present mechanisms to generate synthetic datasets with similar statistical properties to the input data and the same format. We study this technique both theoretically and experimentally. A key theoretical result shows that, with proper randomization, the plausible deniability mechanism generates differentially private synthetic data. We demonstrate the efficiency of this generative technique on a large dataset; it is shown to preserve the utility of original data with respect to various statistical analysis and machine learning measures.