No Arabic abstract
In Privacy Preserving Data Publishing, various privacy models have been developed for employing anonymization operations on sensitive individual level datasets, in order to publish the data for public access while preserving the privacy of individuals in the dataset. However, there is always a trade-off between preserving privacy and data utility; the more changes we make on the confidential dataset to reduce disclosure risk, the more information the data loses and the less data utility it preserves. The optimum privacy technique is the one that results in a dataset with minimum disclosure risk and maximum data utility. In this paper, we propose an improved suppression method, which reduces the disclosure risk and enhances the data utility by targeting the highest risk records and keeping other records intact. We have shown the effectiveness of our approach through an experiment on a real-world confidential dataset.
LDP (Local Differential Privacy) has been widely studied to estimate statistics of personal data (e.g., distribution underlying the data) while protecting users privacy. Although LDP does not require a trusted third party, it regards all personal data equally sensitive, which causes excessive obfuscation hence the loss of utility. In this paper, we introduce the notion of ULDP (Utility-optimized LDP), which provides a privacy guarantee equivalent to LDP only for sensitive data. We first consider the setting where all users use the same obfuscation mechanism, and propose two mechanisms providing ULDP: utility-optimized randomized response and utility-optimized RAPPOR. We then consider the setting where the distinction between sensitive and non-sensitive data can be different from user to user. For this setting, we propose a personalized ULDP mechanism with semantic tags to estimate the distribution of personal data with high utility while keeping secret what is sensitive for each user. We show theoretically and experimentally that our mechanisms provide much higher utility than the existing LDP mechanisms when there are a lot of non-sensitive data. We also show that when most of the data are non-sensitive, our mechanisms even provide almost the same utility as non-private mechanisms in the low privacy regime.
Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this paper, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS+, to extract on-shelf high-utility sequential patterns. For further efficiency, we also designed several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility (TPEU) and time reduced sequence utility (TRSU). In addition, two novel data structures were developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS+ has wider real-life applications owing to its high efficiency.
It is widely known that there is a lot of useful information hidden in big data, leading to a new saying that data is money. Thus, it is prevalent for individuals to mine crucial information for utilization in many real-world applications. In the past, studies have considered frequency. Unfortunately, doing so neglects other aspects, such as utility, interest, or risk. Thus, it is sensible to discover high-utility itemsets (HUIs) in transaction databases while utilizing not only the quantity but also the predefined utility. To find patterns that can represent the supporting transaction, a recent study was conducted to mine high utility-occupancy patterns whose contribution to the utility of the entire transaction is greater than a certain value. Moreover, in realistic applications, patterns may not exist in transactions but be connected to an existence probability. In this paper, a novel algorithm, called High-Utility-Occupancy Pattern Mining in Uncertain databases (UHUOPM), is proposed. The patterns found by the algorithm are called Potential High Utility Occupancy Patterns (PHUOPs). This algorithm divides user preferences into three factors, including support, probability, and utility occupancy. To reduce memory cost and time consumption and to prune the search space in the algorithm as mentioned above, probability-utility-occupancy list (PUO-list) and probability-frequency-utility table (PFU-table) are used, which assist in providing the downward closure property. Furthermore, an original tree structure, called support count tree (SC-tree), is constructed as the search space of the algorithm. Finally, substantial experiments were conducted to evaluate the performance of proposed UHUOPM algorithm on both real-life and synthetic datasets, particularly in terms of effectiveness and efficiency.
As tremendous amount of data being generated everyday from human activity and from devices equipped with sensing capabilities, cloud computing emerges as a scalable and cost-effective platform to store and manage the data. While benefits of cloud computing are numerous, security concerns arising when data and computation are outsourced to a third party still hinder the complete movement to the cloud. In this paper, we focus on the problem of data privacy on the cloud, particularly on access controls over stream data. The nature of stream data and the complexity of sharing data make access control a more challenging issue than in traditional archival databases. We present Streamforce - a system allowing data owners to securely outsource their data to the cloud. The owner specifies fine-grained policies which are enforced by the cloud. The latter performs most of the heavy computations, while learning nothing about the data. To this end, we employ a number of encryption schemes, including deterministic encryption, proxy-based attribute based encryption and sliding-window encryption. In Streamforce, access control policies are modeled as secure continuous queries, which entails minimal changes to existing stream processing engines, and allows for easy expression of a wide-range of policies. In particular, Streamforce comes with a number of secure query operators including Map, Filter, Join and Aggregate. Finally, we implement Streamforce over an open source stream processing engine (Esper) and evaluate its performance on a cloud platform. The results demonstrate practical performance for many real-world applications, and although the security overhead is visible, Streamforce is highly scalable.
A tremendous amount of individual-level data is generated each day, of use to marketing, decision makers, and machine learning applications. This data often contain private and sensitive information about individuals, which can be disclosed by adversaries. An adversary can recognize the underlying individuals identity for a data record by looking at the values of quasi-identifier attributes, known as identity disclosure, or can uncover sensitive information about an individual through attribute disclosure. In Statistical Disclosure Control, multiple disclosure risk measures have been proposed. These share two drawbacks: they do not consider identity and attribute disclosure concurrently in the risk measure, and they make restrictive assumptions on an adversarys knowledge by assuming certain attributes are quasi-identifiers and there is a clear boundary between quasi-identifiers and sensitive information. In this paper, we present a novel disclosure risk measure that addresses these limitations, by presenting a single combined metric of identity and attribute disclosure risk, and providing flexibility in modeling adversarys knowledge. We have developed an efficient algorithm for computing the proposed risk measure and evaluated the feasibility and performance of our approach on a real-world data set from the domain of social work.