No Arabic abstract
Finding or monitoring subgraph instances that are isomorphic to a given pattern graph in a data graph is a fundamental query operation in many graph analytic applications, such as network motif mining and fraud detection. The state-of-the-art distributed methods are inefficient in communication. They have to shuffle partial matching results during the distributed multiway join. The partial matching results may be much larger than the data graph itself. To overcome the drawback, we develop the Batch-BENU framework (B-BENU) for distributed subgraph enumeration. B-BENU executes a group of local search tasks in parallel. Each task enumerates subgraphs around a vertex in the data graph, guided by a backtracking-based execution plan. B-BENU does not shuffle any partial matching result. Instead, it stores the data graph in a distributed database. Each task queries adjacency sets of the data graph on demand. To support dynamic data graphs, we propose the concept of incremental pattern graphs and turn continuous subgraph enumeration into enumerating incremental pattern graphs at each time step. We develop the Streaming-BENU framework (S-BENU) to enumerate their matches efficiently. We implement B-BENU and S-BENU with the local database cache and the task splitting techniques. The extensive experiments show that B-BENU and S-BENU can scale to big data graphs and complex pattern graphs. They outperform the state-of-the-art methods by up to one and two orders of magnitude, respectively.
Fine tuning distributed systems is considered to be a craftsmanship, relying on intuition and experience. This becomes even more challenging when the systems need to react in near real time, as streaming engines have to do to maintain pre-agreed service quality metrics. In this article, we present an automated approach that builds on a combination of supervised and reinforcement learning methods to recommend the most appropriate lever configurations based on previous load. With this, streaming engines can be automatically tuned without requiring a human to determine the right way and proper time to deploy them. This opens the door to new configurations that are not being applied today since the complexity of managing these systems has surpassed the abilities of human experts. We show how reinforcement learning systems can find substantially better configurations in less time than their human counterparts and adapt to changing workloads.
In this paper, we develop RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six serializable concurrency control protocols: not only the classical protocols NOWAIT, WAITDIE, and OCC, but also more advanced MVCC and SUNDIAL, and even CALVIN, the deterministic concurrency control protocol. Our goal is to unbiasedly compare the protocols in a common execution environment with the concurrency control protocol being the only changeable component. We focus on the correct and efficient implementation using key techniques, such as co-routines, outstanding requests, and doorbell batching, with two-sided and one-sided communication primitives. Based on RCC, we get the deep insights that cannot be obtained by any existing systems. Most importantly, we obtain the execution stage latency breakdowns with one-sided and two-sided primitive for each protocol, which are analyzed to develop more efficient hybrid implementations. Our results show that three hybrid designs are indeed better than both one-sided and two-sided implementations by up to 17.8%. We believe that RCC is a significant advance over the state-of-the-art; it can both provide performance insights and be used as the common infrastructure for fast prototyping new implementations.
In the fifth-generation (5G) networks and the beyond, communication latency and network bandwidth will be no more bottleneck to mobile users. Thus, almost every mobile device can participate in the distributed learning. That is, the availability issue of distributed learning can be eliminated. However, the model safety will become a challenge. This is because the distributed learning system is prone to suffering from byzantine attacks during the stages of updating model parameters and aggregating gradients amongst multiple learning participants. Therefore, to provide the byzantine-resilience for distributed learning in 5G era, this article proposes a secure computing framework based on the sharding-technique of blockchain, namely PIRATE. A case-study shows how the proposed PIRATE contributes to the distributed learning. Finally, we also envision some open issues and challenges based on the proposed byzantine-resilient learning framework.
A Range-Skyline Query (RSQ) is the combination of range query and skyline query. It is one of the practical query types in multi-criteria decision services, which may include the spatial and non-spatial information as well as make the resulting information more useful than skyline search when the location is concerned. Furthermore, Continuous Range-Skyline Query (CRSQ) is an extension of Range-Skyline Query (RSQ) that the system continuously reports the skyline results to a query within a given search range. This work focuses on the RSQ and CRSQ within a specific range on Internet of Mobile Things (IoMT) applications. Many server-client approaches for CRSQ have been proposed but are sensitive to the number of moving objects. We propose an effective and non-centralized approach, Distributed Continuous Range-Skyline Query process (DCRSQ process), for supporting RSQ and CRSQ in mobile environments. By considering the mobility, the proposed approach can predict the time when an object falls in the query range and ignore more irrelevant information when deriving the results, thus saving the computation overhead. The proposed approach, DCRSQ process, is analyzed on cost and validated with extensive simulated experiments. The results show that DCRSQ process outperforms the existing approaches in different scenarios and aspects.
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.