No Arabic abstract
Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall short of properly capturing density patterns and correlations that are intrinsic to spatial data, and as a result yield poor accuracy. We propose a machine-learning based approach for answering statistical queries on location data with DP guarantees. We focus on countering the main source of error that plagues existing approaches (namely, uniformity error), and we design a neural database system that models spatial datasets such that important density and correlation features present in the data are preserved, even when DP-compliant noise is added. We employ a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. We also devise a framework for effective system parameter tuning on top of public data, which helps practitioners set important system parameters without having to expend scarce privacy budget. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approach significantly outperforms the state of the art.
A corpus of recent work has revealed that the learned index can improve query performance while reducing the storage overhead. It potentially offers an opportunity to address the spatial query processing challenges caused by the surge in location-based services. Although several learned indexes have been proposed to process spatial data, the main idea behind these approaches is to utilize the existing one-dimensional learned models, which requires either converting the spatial data into one-dimensional data or applying the learned model on individual dimensions separately. As a result, these approaches cannot fully utilize or take advantage of the information regarding the spatial distribution of the original spatial data. To this end, in this paper, we exploit it by using the spatial (multi-dimensional) interpolation function as the learned model, which can be directly employed on the spatial data. Specifically, we design an efficient SPatial inteRpolation functIon based Grid index (SPRIG) to process the range and kNN queries. Detailed experiments are conducted on real-world datasets, and the results indicate that our proposed learned index can significantly improve the performance in comparison with the traditional spatial indexes and a state-of-the-art multi-dimensional learned index.
The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs. After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.
Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are particularly important in data exploration and targeted anomaly explanation and diagnosis. In these scenarios, the data owner computes the following information: i) The attributes that contribute to the abnormality of an outlier (metric), ii) Contextual description of the outliers neighborhoods (context), and iii) The utility score of the context, e.g. its strength in showing the outliers significance, or in relation to a particular explanation for the outlier. However, revealing the outliers context leaks information about the other individuals in the population as well, violating their privacy. We address the issue of population privacy violations in this paper, and propose a solution for the two main challenges. In this setting, the data owner is required to release a valid context for the queried record, i.e. a context in which the record is an outlier. Hence, the first major challenge is that the privacy technique must preserve the validity of the context for each record. We propose techniques to protect the privacy of individuals through a relaxed notion of differential privacy to satisfy this requirement. The second major challenge is applying the proposed techniques efficiently, as they impose intensive computation to the base algorithm. To overcome this challenge, we propose a graph structure to map the contexts to, and introduce differentially private graph search algorithms as efficient solutions for the computation problem caused by differential privacy techniques.
Graphs are widely used to model data in many application domains. Thanks to the wide spread use of GPS-enabled devices, many applications assign a spatial attribute to graph vertices (e.g., geo-tagged social media). Users may issue a Reachability Query with Spatial Range Predicate (abbr. RangeReach). RangeReach finds whether an input vertex can reach any spatial vertex that lies within an input spatial range. An example of a RangeReach query is: Given a social graph, find whether Alice can reach any of the venues located within the geographical area of Arizona State University. The paper proposes GeoReach an approach that adds spatial data awareness to a graph database management system (GDBMS). GeoReach allows efficient execution of RangeReach queries, yet without compromising a lot on the overall system scalability (measured in terms of storage size and initialization/maintenance time). To achieve that, GeoReach is equipped with a light-weight data structure, namely SPA-Graph, that augments the underlying graph data with spatial indexing directories. When a RangeReach query is issued, the system employs a pruned-graph traversal approach. Experiments based on real system implementation inside Neo4j proves that GEOREACH exhibits up to two orders of magnitude better query response time and up to four times less storage than the state-of-the-art spatial and reachability indexing approaches.
Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.