Efficient Tensor Contraction via Fast Count Sketch

213 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xingyu Cao

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xingyu Cao - Jiani Liu

التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Sketching uses randomized Hash functions for dimensionality reduction and acceleration. The existing sketching methods, such as count sketch (CS), tensor sketch (TS), and higher-order count sketch (HCS), either suffer from low accuracy or slow speed in some tensor based applications. In this paper, the proposed fast count sketch (FCS) applies multiple shorter Hash functions based CS to the vector form of the input tensor, which is more accurate than TS since the spatial information of the input tensor can be preserved more sufficiently. When the input tensor admits CANDECOMP/PARAFAC decomposition (CPD), FCS can accelerate CS and HCS by using fast Fourier transform, which exhibits a computational complexity asymptotically identical to TS for low-order tensors. The effectiveness of FCS is validated by CPD, tensor regression network compression, and Kronecker product compression. Experimental results show its superior performance in terms of approximation accuracy and computational efficiency.

قيم البحث

131 - HanQin Cai , Zehan Chao , Longxiu Huang 2021

We study the problem of tensor robust principal component analysis (TRPCA), which aims to separate an underlying low-multilinear-rank tensor and a sparse outlier tensor from their sum. In this work, we propose a fast non-convex algorithm, coined Robu st Tensor CUR (RTCUR), for large-scale TRPCA problems. RTCUR considers a framework of alternating projections and utilizes the recently developed tensor Fiber CUR decomposition to dramatically lower the computational complexity. The performance advantage of RTCUR is empirically verified against the state-of-the-arts on the synthetic datasets and is further demonstrated on the real-world application such as color video background subtraction.

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط معالجة الصور والفيديو

Fast and Memory Efficient Differentially Private-SGD via JL Projections

103 - Zhiqi Bu , Sivakanth Gopi , Janardhan Kulkarni 2021

Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-JL. Our approach uses Johnson-Lindenstrauss (JL) projections to quickly approximate the per-sample gradient norms without exactly computing them, thus making the training time and memory requirements of our optimizers closer to that of their non-

التعلم الآلي التشفير والأمن

Jet: Fast quantum circuit simulations with parallel task-based tensor-network contraction

130 - Trevor Vincent , Lee J. ORiordan , Mikhail Andrenkov 2021

We introduce a new open-source software library Jet, which uses task-based parallelism to obtain speed-ups in classical tensor-network simulations of quantum circuits. These speed-ups result from i) the increased parallelism introduced by mapping the tensor-network simulation to a task-based framework, ii) a novel method of reusing shared work between tensor-network contraction tasks, and iii) the concurrent contraction of tensor networks on all available hardware. We demonstrate the advantages of our method by benchmarking our code on several Sycamore-53 and Gaussian boson sampling (GBS) supremacy circuits against other simulators. We also provide and compare theoretical performance estimates for tensor-network simulations of Sycamore-53 and GBS supremacy circuits for the first time.

فيزياء الكم

Parameterization of tensor network contraction

126 - Bryan OGorman 2019

We present a conceptually clear and algorithmically useful framework for parameterizing the costs of tensor network contraction. Our framework is completely general, applying to tensor networks with arbitrary bond dimensions, open legs, and hyperedge s. The fundamental objects of our framework are rooted and unrooted contraction trees, which represent classes of contraction orders. Properties of a contraction tree correspond directly and precisely to the time and space costs of tensor network contraction. The properties of rooted contraction trees give the costs of parallelized contraction algorithms. We show how contraction trees relate to existing tree-like objects in the graph theory literature, bringing to bear a wide range of graph algorithms and tools to tensor network contraction. Independent of tensor networks, we show that the edge congestion of a graph is almost equal to the branchwidth of its line graph.

بنى وهياكل البيانات والخوارزميات فيزياء الكم

Strassens Algorithm for Tensor Contraction

103 - Jianyu Huang , Devin A. Matthews , Robert A. van de Geijn 2017

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassens algorithm for GEMM is well studied in theory and practice, ex tending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassens algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassens algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3x speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.

البرمجيات الرياضية