Parallelized Computation and Backpropagation Under Angle-Parametrized Orthogonal Matrices

94 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Firas Hamze

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Firas Hamze

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present a methodology for parallel acceleration of learning in the presence of matrix orthogonality and unitarity constraints of interest in several branches of machine learning. We show how an apparently sequential elementary rotation parametrization can be restructured into blocks of commutative operations using a well-known tool for coloring the edges of complete graphs, in turn widely applied to schedule round-robin (all-against-all) sports tournaments. The resulting decomposition admits an algorithm to compute a fully-parametrized orthogonal matrix from its rotation parameters in $O(n)$ sequential steps and one to compute the gradient of a training loss with respect to its parameters in $O(nlog n)$ steps. We discuss parametric restrictions of interest to generative modeling and present promising performance results with a prototype GPU implementation.

قيم البحث

64 - Valerii Likhosherstov , Jared Davis , Krzysztof Choromanski 2020

We introduce an efficient approach for optimization over orthogonal groups on highly parallel computation units such as GPUs or TPUs. As in earlier work, we parametrize an orthogonal matrix as a product of Householder reflections. However, to overcom e low parallelization capabilities of computing Householder reflections sequentially, we propose employing an accumulation scheme called the compact WY (or CWY) transform -- a compact parallelization-friendly matrix representation for the series of Householder reflections. We further develop a novel Truncated CWY (or T-CWY) approach for Stiefel manifold parametrization which has a competitive complexity and, again, yields benefits when computed on GPUs and TPUs. We prove that our CWY and T-CWY methods lead to convergence to a stationary point of the training objective when coupled with stochastic gradient descent. We apply our methods to train recurrent neural network architectures in the tasks of neural machine translation and video prediction.

التعلم الآلي التعلم الالي

Proximal Backpropagation

174 - Thomas Frerix , Thomas Mollenhoff , Michael Moeller 2017

We propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of expl icit gradient descent, which poses an impediment for optimization. ProxProp is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of ProxProp and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efficient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that ProxProp can be effectively combined with common first order optimizers such as Adam.

التعلم الآلي

Optimizing Pipelined Computation and Communication for Latency-Constrained Edge Learning

67 - Nicolas Skatchkovsky , Osvaldo Simeone 2019

Consider a device that is connected to an edge processor via a communication channel. The device holds local data that is to be offloaded to the edge processor so as to train a machine learning model, e.g., for regression or classification. Transmiss ion of the data to the learning processor, as well as training based on Stochastic Gradient Descent (SGD), must be both completed within a time limit. Assuming that communication and computation can be pipelined, this letter investigates the optimal choice for the packet payload size, given the overhead of each data packet transmission and the ratio between the computation and the communication rates. This amounts to a tradeoff between bias and variance, since communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Analytical bounds on the expected optimality gap are derived so as to enable an effective optimization, which is validated in numerical results.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

Parallelized Reverse Curriculum Generation

108 - Zih-Yun Chiu , Yi-Lin Tuan , Hung-yi Lee 2021

For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards. To solve this problem, reverse curriculum generation (RCG) provides a reverse expansion approach that a utomatically generates a curriculum for the agent to learn. More specifically, RCG adapts the initial state distribution from the neighborhood of a goal to a distance as training proceeds. However, the initial state distribution generated for each iteration might be biased, thus making the policy overfit or slowing down the reverse expansion rate. While training RCG for actor-critic (AC) based RL algorithms, this poor generalization and slow convergence might be induced by the tight coupling between an AC pair. Therefore, we propose a parallelized approach that simultaneously trains multiple AC pairs and periodically exchanges their critics. We empirically demonstrate that this proposed approach can improve RCG in performance and convergence, and it can also be applied to other AC based RL algorithms with adapted initial state distribution.

التعلم الآلي علم الروبوتات

Orthogonal bundles and skew-Hamiltonian matrices

249 - Roland Abuaf , Ada Boralevi 2013

Using properties of skew-Hamiltonian matrices and classic connectedness results, we prove that the moduli space $M_{ort}^0(r,n)$ of stable rank $r$ orthogonal vector bundles on $mathbb{P}^2$, with Chern classes $(c_1,c_2)=(0,n)$, and trivial splittin g on the general line, is smooth irreducible of dimension $(r-2)n-{r choose 2}$ for specific values of $r$ and $n$.

الهندسة الجبرية