ترغب بنشر مسار تعليمي؟ اضغط هنا

As random walk is a powerful tool in many graph processing, mining and learning applications, this paper proposes an efficient in-memory random walk engine named ThunderRW. Compared with existing parallel systems on improving the performance of a sin gle graph operation, ThunderRW supports massive parallel random walks. The core design of ThunderRW is motivated by our profiling results: common RW algorithms have as high as 73.1% CPU pipeline slots stalled due to irregular memory access, which suffers significantly more memory stalls than the conventional graph workloads such as BFS and SSSP. To improve the memory efficiency, we first design a generic step-centric programming model named Gather-Move-Update to abstract different RW algorithms. Based on the programming model, we develop the step interleaving technique to hide memory access latency by switching the executions of different random walk queries. In our experiments, we use four representative RW algorithms including PPR, DeepWalk, Node2Vec and MetaPath to demonstrate the efficiency and programming flexibility of ThunderRW. Experimental results show that ThunderRW outperforms state-of-the-art approaches by an order of magnitude, and the step interleaving technique significantly reduces the CPU pipeline stall from 73.1% to 15.0%.
For asynchronous binary agreement (ABA) with optimal resilience, prior private-setup free protocols (Cachin et al., CCS 2002; Kokoris-Kogias et al., CCS 2020) incur $O({lambda}n^4)$ bits and $O(n^3)$ messages; for asynchronous multi-valued agreement with external validity (VBA), Abraham et al. [2] very recently gave the first elegant construction with $O(n^3)$ messages, relying on public key infrastructure (PKI), but still costs $O({lambda} n^3 log n)$ bits. We for the first time close the remaining efficiency gap, i.e., reducing their communication to $O({lambda} n^3)$ bits on average. At the core of our design, we give a systematic treatment of reasonably fair common randomness: - We construct a reasonably fair common coin (Canetti and Rabin, STOC 1993) in the asynchronous setting with PKI instead of private setup, using only $O({lambda} n^3)$ bit and constant asynchronous rounds. The common coin protocol ensures that with at least 1/3 probability, all honest parties can output a common bit that is as if uniformly sampled, rendering a more efficient private-setup free ABA with expected $O({lambda} n^3)$ bit communication and constant running time. - More interestingly, we lift our reasonably fair common coin protocol to attain perfect agreement without incurring any extra factor in the asymptotic complexities, resulting in an efficient reasonably fair leader election primitive pluggable in all existing VBA protocols, thus reducing the communication of private-setup free VBA to expected $O({lambda} n^3)$ bits while preserving expected constant running time. - Along the way, we improve an important building block, asynchronous verifiable secret sharing by presenting a private-setup free implementation costing only $O({lambda} n^2)$ bits in the PKI setting. By contrast, prior art having the same complexity (Backes et al., CT-RSA 2013) has to rely on a private setup.
ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optim ally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.
Volumetric design is the first and critical step for professional building design, where architects not only depict the rough 3D geometry of the building but also specify the programs to form a 2D layout on each floor. Though 2D layout generation for a single story has been widely studied, there is no developed method for multi-story buildings. This paper focuses on volumetric design generation conditioned on an input program graph. Instead of outputting dense 3D voxels, we propose a new 3D representation named voxel graph that is both compact and expressive for building geometries. Our generator is a cross-modal graph neural network that uses a pointer mechanism to connect the input program graph and the output voxel graph, and the whole pipeline is trained using the adversarial framework. The generated designs are evaluated qualitatively by a user study and quantitatively using three metrics: quality, diversity, and connectivity accuracy. We show that our model generates realistic 3D volumetric designs and outperforms previous methods and baselines.
In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications. Since overlapped speech is common in this case, convent ional approaches usually address this problem in a cascaded fashion that involves speech separation, speech recognition and speaker identification that are trained independently. In this paper, we propose Streaming Unmixing, Recognition and Identification Transducer (SURIT) -- a new framework that deals with this problem in an end-to-end streaming fashion. SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the LibrispeechMix dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
As large graph processing emerges, we observe a costly fork-processing pattern (FPP) that is common in many graph algorithms. The unique feature of the FPP is that it launches many independent queries from different source vertices on the same graph. For example, an algorithm in analyzing the network community profile can execute Personalized PageRanks that start from tens of thousands of source vertices at the same time. We study the efficiency of handling FPPs in state-of-the-art graph processing systems on multi-core architectures. We find that those systems suffer from severe cache miss penalty because of the irregular and uncoordinated memory accesses in processing FPPs. In this paper, we propose ForkGraph, a cache-efficient FPP processing system on multi-core architectures. To improve the cache reuse, we divide the graph into partitions each sized of LLC capacity, and the queries in an FPP are buffered and executed on the partition basis. We further develop efficient intra- and inter-partition execution strategies for efficiency. For intra-partition processing, since the graph partition fits into LLC, we propose to execute each graph query with efficient sequential algorithms (in contrast with parallel algorithms in existing parallel graph processing systems) and present an atomic-free query processing by consolidating contending operations to cache-resident graph partition. For inter-partition processing, we propose yielding and priority-based scheduling, to reduce redundant work in processing. Besides, we theoretically prove that ForkGraph performs the same amount of work, to within a constant factor, as the fastest known sequential algorithms in FPP queries processing, which is work efficient. Our evaluations on real-world graphs show that ForkGraph significantly outperforms state-of-the-art graph processing systems with two orders of magnitude speedups.
Optimistic asynchronous atomic broadcast was proposed to improve the performance of asynchronous protocols while maintaining their liveness in unstable networks (Kursawe-Shoup, 2002; Ramasamy-Cachin, 2005). They used a faster deterministic protocol i n the optimistic case when the network condition remains good, and can safely fallback to a pessimistic path running asynchronous atomic broadcast once the fast path fails to proceed. Unfortunately, besides that the pessimistic path is slow, existing fallback mechanisms directly use a heavy tool of asynchronous multi-valued validated Byzantine agreement (MVBA). When deployed on the open Internet, which could be fluctuating, the inefficient fallback may happen frequently thus the benefits of adding the optimistic path are eliminated. We give a generic framework for practical optimistic asynchronous atomic broadcast. A new abstraction of the optimistic case protocols, which can be instantiated easily, is presented. More importantly, it enables us to design a highly efficient fallback mechanism to handle the fast path failures. The resulting fallback replaces the cumbersome MVBA by a variant of simple binary agreement only. Besides a detailed security analysis, we also give concrete instantiations of our framework and implement them. Extensive experiments show that our new fallback mechanism adds minimal overhead, demonstrating that our framework can enjoy both the low latency of deterministic protocols and robust liveness of randomized asynchronous protocols in practice.
In this paper, we construct neural networks with ReLU, sine and $2^x$ as activation functions. For general continuous $f$ defined on $[0,1]^d$ with continuity modulus $omega_f(cdot)$, we construct ReLU-sine-$2^x$ networks that enjoy an approximation rate $mathcal{O}(omega_f(sqrt{d})cdot2^{-M}+omega_{f}left(frac{sqrt{d}}{N}right))$, where $M,Nin mathbb{N}^{+}$ denote the hyperparameters related to widths of the networks. As a consequence, we can construct ReLU-sine-$2^x$ network with the depth $5$ and width $maxleft{leftlceil2d^{3/2}left(frac{3mu}{epsilon}right)^{1/{alpha}}rightrceil,2leftlceillog_2frac{3mu d^{alpha/2}}{2epsilon}rightrceil+2right}$ that approximates $fin mathcal{H}_{mu}^{alpha}([0,1]^d)$ within a given tolerance $epsilon >0$ measured in $L^p$ norm $pin[1,infty)$, where $mathcal{H}_{mu}^{alpha}([0,1]^d)$ denotes the Holder continuous function class defined on $[0,1]^d$ with order $alpha in (0,1]$ and constant $mu > 0$. Therefore, the ReLU-sine-$2^x$ networks overcome the curse of dimensionality on $mathcal{H}_{mu}^{alpha}([0,1]^d)$. In addition to its supper expressive power, functions implemented by ReLU-sine-$2^x$ networks are (generalized) differentiable, enabling us to apply SGD to train.
Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep networks often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on shortcuts - superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-world concepts often possess a complex structure that can vary superficially across contexts, which can make the most intuitive and promising solutions in one context not generalize to others. One potential way to improve o.o.d. generalization is to assume simple solutions are unlikely to be valid across contexts and avoid them, which we refer to as the too-good-to-be-true prior. A low-capacity network (LCN) with a shallow architecture should only be able to learn surface relationships, including shortcuts. We find that LCNs can serve as shortcut detectors. Furthermore, an LCNs predictions can be used in a two-stage approach to encourage a high-capacity network (HCN) to rely on deeper invariant features that should generalize broadly. In particular, items that the LCN can master are downweighted when training the HCN. Using a modified version of the CIFAR-10 dataset in which we introduced shortcuts, we found that the two-stage LCN-HCN approach reduced reliance on shortcuts and facilitated o.o.d. generalization.
Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configu ration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srifts choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا