Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Parallelized Computation and Backpropagation Under Angle-Parametrized Orthogonal Matrices

94 0 0.0 ( 0 )

Download Cite

Added by Firas Hamze

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Firas Hamze

Machine Learning Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present a methodology for parallel acceleration of learning in the presence of matrix orthogonality and unitarity constraints of interest in several branches of machine learning. We show how an apparently sequential elementary rotation parametrization can be restructured into blocks of commutative operations using a well-known tool for coloring the edges of complete graphs, in turn widely applied to schedule round-robin (all-against-all) sports tournaments. The resulting decomposition admits an algorithm to compute a fully-parametrized orthogonal matrix from its rotation parameters in $O(n)$ sequential steps and one to compute the gradient of a training loss with respect to its parameters in $O(nlog n)$ steps. We discuss parametric restrictions of interest to generative modeling and present promising performance results with a prototype GPU implementation.

rate research

CWY Parametrization: a Solution for Parallelized Optimization of Orthogonal and Stiefel Matrices

64 - Valerii Likhosherstov , Jared Davis , Krzysztof Choromanski 2020

We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcome low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform -- a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction.

Machine Learning Machine Learning

Proximal Backpropagation

174 - Thomas Frerix , Thomas Mollenhoff , Michael Moeller 2017

We propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. ProxProp is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of ProxProp and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efficient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that ProxProp can be effectively combined with common first order optimizers such as Adam.

Machine Learning

Optimizing Pipelined Computation and Communication for Latency-Constrained Edge Learning

67 - Nicolas Skatchkovsky , Osvaldo Simeone 2019

Consider a device that is connected to an edge processor via a communication channel. The device holds local data that is to be offloaded to the edge processor so as to train a machine learning model, e.g., for regression or classification. Transmission of the data to the learning processor, as well as training based on Stochastic Gradient Descent (SGD), must be both completed within a time limit. Assuming that communication and computation can be pipelined, this letter investigates the optimal choice for the packet payload size, given the overhead of each data packet transmission and the ratio between the computation and the communication rates. This amounts to a tradeoff between bias and variance, since communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Analytical bounds on the expected optimality gap are derived so as to enable an effective optimization, which is validated in numerical results.

Machine Learning Distributed Parallel and Cluster Computing Information Theory

Parallelized Reverse Curriculum Generation

108 - Zih-Yun Chiu , Yi-Lin Tuan , Hung-yi Lee 2021

For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards. To solve this problem, reverse curriculum generation (RCG) provides a reverse expansion approach that automatically generates a curriculum for the agent to learn. More specifically, RCG adapts the initial state distribution from the neighborhood of a goal to a distance as training proceeds. However, the initial state distribution generated for each iteration might be biased, thus making the policy overfit or slowing down the reverse expansion rate. While training RCG for actor-critic (AC) based RL algorithms, this poor generalization and slow convergence might be induced by the tight coupling between an AC pair. Therefore, we propose a parallelized approach that simultaneously trains multiple AC pairs and periodically exchanges their critics. We empirically demonstrate that this proposed approach can improve RCG in performance and convergence, and it can also be applied to other AC based RL algorithms with adapted initial state distribution.

Machine Learning Robotics

Orthogonal bundles and skew-Hamiltonian matrices

434 - Roland Abuaf , Ada Boralevi 2013

Using properties of skew-Hamiltonian matrices and classic connectedness results, we prove that the moduli space $M_{ort}^0(r,n)$ of stable rank $r$ orthogonal vector bundles on $mathbb{P}^2$, with Chern classes $(c_1,c_2)=(0,n)$, and trivial splitting on the general line, is smooth irreducible of dimension $(r-2)n-{r choose 2}$ for specific values of $r$ and $n$.

Algebraic Geometry

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Parallelized Computation and Backpropagation Under Angle-Parametrized Orthogonal Matrices

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions