No Arabic abstract
We present an efficient clustering algorithm applicable to one-dimensional data such as e.g. a series of timestamps. Given an expected frequency $Delta T^{-1}$, we introduce an $mathcal{O}(N)$-efficient method of characterizing $N$ events represented by an ordered series of timestamps $t_1,t_2,dots,t_N$. In practice, the method proves useful to e.g. identify time intervals of missing data or to locate isolated events. Moreover, we define measures to quantify a series of events by varying $Delta T$ to e.g. determine the quality of an Internet of Things service.
Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.
In this paper, we consider the Byzantine reliable broadcast problem on authenticated and partially connected networks. The state-of-the-art method to solve this problem consists in combining two algorithms from the literature. Handling asynchrony and faulty senders is typically done thanks to Gabriel Brachas authenticated double-echo broadcast protocol, which assumes an asynchronous fully connected network. Danny Dolevs algorithm can then be used to provide reliable communications between processes in the global fault model, where up to f processes among N can be faulty in a communication network that is at least 2f+1-connected. Following recent works that showed that Dolevs protocol can be made more practical thanks to several optimizations, we show that the state-of-the-art methods to solve our problem can be optimized thanks to layer-specific and cross-layer optimizations. Our simulations with the Omnet++ network simulator show that these optimizations can be efficiently combined to decrease the total amount of information transmitted or the protocols latency (e.g., respectively, -25% and -50% with a 16B payload, N=31 and f=4) compared to the state-of-the-art combination of Brachas and Dolevs protocols.
Synthetic medical data which preserves privacy while maintaining utility can be used as an alternative to real medical data, which has privacy costs and resource constraints associated with it. At present, most models focus on generating cross-sectional health data which is not necessarily representative of real data. In reality, medical data is longitudinal in nature, with a single patient having multiple health events, non-uniformly distributed throughout their lifetime. These events are influenced by patient covariates such as comorbidities, age group, gender etc. as well as external temporal effects (e.g. flu season). While there exist seminal methods to model time series data, it becomes increasingly challenging to extend these methods to medical event time series data. Due to the complexity of the real data, in which each patient visit is an event, we transform the data by using summary statistics to characterize the events for a fixed set of time intervals, to facilitate analysis and interpretability. We then train a generative adversarial network to generate synthetic data. We demonstrate this approach by generating human sleep patterns, from a publicly available dataset. We empirically evaluate the generated data and show close univariate resemblance between synthetic and real data. However, we also demonstrate how stratification by covariates is required to gain a deeper understanding of synthetic data quality.
We propose an univesal scheme to design loop-free and super-stabilizing protocols for constructing spanning trees optimizing any tree metrics (not only those that are isomorphic to a shortest path tree). Our scheme combines a novel super-stabilizing loop-free BFS with an existing self-stabilizing spanning tree that optimizes a given metric. The composition result preserves the best properties of both worlds: super-stabilization, loop-freedom, and optimization of the original metric without any stabilization time penalty. As case study we apply our composition mechanism to two well known metric-dependent spanning trees: the maximum-flow tree and the minimum degree spanning tree.
We present a novel self-stabilizing algorithm for minimum spanning tree (MST) construction. The space complexity of our solution is $O(log^2n)$ bits and it converges in $O(n^2)$ rounds. Thus, this algorithm improves the convergence time of all previously known self-stabilizing asynchronous MST algorithms by a multiplicative factor $Theta(n)$, to the price of increasing the best known space complexity by a factor $O(log n)$. The main ingredient used in our algorithm is the design, for the first time in self-stabilizing settings, of a labeling scheme for computing the nearest common ancestor with only $O(log^2n)$ bits.