No Arabic abstract
A streaming graph is a graph formed by a sequence of incoming edges with time stamps. Unlike static graphs, the streaming graph is highly dynamic and time related. In the real world, the high volume and velocity streaming graphs such as internet traffic data, social network communication data and financial transfer data are bringing challenges to the classic graph data structures. We present a new data structure: double orthogonal list in hash table (Dolha) which is a high speed and high memory efficiency graph structure applicable to streaming graph. Dolha has constant time cost for single edge and near linear space cost that we can contain billions of edges information in memory size and process an incoming edge in nanoseconds. Dolha also has linear time cost for neighborhood queries, which allow it to support most algorithms in graphs without extra cost. We also present a persistent structure based on Dolha that has the ability to handle the sliding window update and time related queries.
In the presence of dynamic insertions and deletions into a partially reconfigurable FPGA, fragmentation is unavoidable. This poses the challenge of developing efficient approaches to dynamic defragmentation and reallocation. One key aspect is to develop efficient algorithms and data structures that exploit the two-dimensional geometry of a chip, instead of just one. We propose a new method for this task, based on the fractal structure of a quadtree, which allows dynamic segmentation of the chip area, along with dynamically adjusting the necessary communication infrastructure. We describe a number of algorithmic aspects, and present different solutions. We also provide a number of basic simulations that indicate that the theoretical worst-case bound may be pessimistic.
Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them. Prompted by this fact, in this work we study and analyze parallelization needs of these articulation points, focusing on the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. The analysis of the parallelization needs, as well as of the use and limitations of existing aggregate designs and their data structures, leads us to identify needs for proper shared objects that can achieve low-latency and high throughput multiway aggregation. We present the requirements of such objects as abstract data types and we provide efficient lock-free linearizable algorithmic implementations of them, along with new multiway aggregate algorithmic designs that leverage them, supporting both deterministic order-sensitive and order-insensitive aggregate functions. Furthermore, we point out future directions that open through these contributions. The paper includes an extensive experimental study, based on a variety of aggregation continuous queries on two large datasets extracted from SoundCloud, a music social network, and from a Smart Grid network. In all the experiments, the proposed data structures and the enhanced aggregate operators improved the processing performance significantly, up to one order of magnitude, in terms of both throughput and latency, over the commonly-used techniques based on queues.
There has been a significant amount of work in the literature proposing semantic relaxation of concurrent data structures for improving scalability and performance. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this paper. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. Our framework introduces a new two dimensional algorithmic design, that uses multiple instances of a given data structure. The first dimension of our design is the number of data structure instances operations are spread to, in order to benefit from parallelism through disjoint memory access. The second dimension is the number of consecutive operations that try to use the same data structure instance in order to benefit from data locality. Our design can flexibly explore this two-dimensional space to achieve the property of monotonically relaxing concurrent data structure semantics for achieving better throughput performance within a tight deterministic relaxation bound, as we prove in the paper. We show how our framework can instantiate lock-free out-of-order queues, stacks, counters and dequeues. We provide implementations of these relaxed data structures and evaluate their performance and behaviour on two parallel architectures. Experimental evaluation shows that our two-dimensional data structures significantly outperform the respected previous proposed ones with respect to scalability and throughput performance. Moreover, their throughput increases monotonically as relaxation increases.
We present new and improved data structures that answer exact node-to-node distance queries in planar graphs. Such data structures are also known as distance oracles. For any directed planar graph on n nodes with non-negative lengths we obtain the following: * Given a desired space allocation $Sin[nlglg n,n^2]$, we show how to construct in $tilde O(S)$ time a data structure of size $O(S)$ that answers distance queries in $tilde O(n/sqrt S)$ time per query. As a consequence, we obtain an improvement over the fastest algorithm for k-many distances in planar graphs whenever $kin[sqrt n,n)$. * We provide a linear-space exact distance oracle for planar graphs with query time $O(n^{1/2+eps})$ for any constant eps>0. This is the first such data structure with provable sublinear query time. * For edge lengths at least one, we provide an exact distance oracle of space $tilde O(n)$ such that for any pair of nodes at distance D the query time is $tilde O(min {D,sqrt n})$. Comparable query performance had been observed experimentally but has never been explained theoretically. Our data structures are based on the following new tool: given a non-self-crossing cycle C with $c = O(sqrt n)$ nodes, we can preprocess G in $tilde O(n)$ time to produce a data structure of size $O(n lglg c)$ that can answer the following queries in $tilde O(c)$ time: for a query node u, output the distance from u to all the nodes of C. This data structure builds on and extends a related data structure of Klein (SODA05), which reports distances to the boundary of a face, rather than a cycle. The best distance oracles for planar graphs until the current work are due to Cabello (SODA06), Djidjev (WG96), and Fakcharoenphol and Rao (FOCS01). For $sigmain(1,4/3)$ and space $S=n^sigma$, we essentially improve the query time from $n^2/S$ to $sqrt{n^2/S}$.
Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges and subgraphs in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? For example, in intrusion detection, existing work seeks to detect either anomalous edges or anomalous subgraphs, but not both. In this paper, we first extend the count-min sketch data structure to a higher-order sketch. This higher-order sketch has the useful property of preserving the dense subgraph structure (dense subgraphs in the input turn into dense submatrices in the data structure). We then propose four online algorithms that utilize this enhanced data structure, which (a) detect both edge and graph anomalies; (b) process each edge and graph in constant memory and constant update time per newly arriving edge, and; (c) outperform state-of-the-art baselines on four real-world datasets. Our method is the first streaming approach that incorporates dense subgraph search to detect graph anomalies in constant memory and time.