Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Parallel computation of echelon forms

515 0 0.0 ( 0 )

Download Cite

Added by Jean-Guillaume Dumas

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors Jean-Guillaume Dumas

Symbolic Computation Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occur. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency.

rate research

Parallel sparse interpolation using small primes

468 - Mohamed Khochtali , Daniel S. Roche , Xisen Tian 2015

To interpolate a supersparse polynomial with integer coefficients, two alternative approaches are the Prony-based big prime technique, which acts over a single large finite field, or the more recently-proposed small primes technique, which reduces the unknown sparse polynomial to many low-degree dense polynomials. While the latter technique has not yet reached the same theoretical efficiency as Prony-based methods, it has an obvious potential for parallelization. We present a heuristic small primes interpolation algorithm and report on a low-level C implementation using FLINT and MPI.

Symbolic Computation Distributed Parallel and Cluster Computing

ParFORM: Parallel Version of the Symbolic Manipulation Program FORM

120 - M.Tentyukov , D.Fliegner , M.Frank 2004

After an introduction to the sequential version of FORM and the mechanisms behind, we report on the status of our project of parallelization. We have now a parallel version of FORM running on Cluster- and SMP-architectures. This version can be used to run arbitrary FORM programs in parallel.

Symbolic Computation Distributed Parallel and Cluster Computing High Energy Physics - Phenomenology

Parallel Integer Polynomial Multiplication

319 - Changbo Chen , Svyatoslav Covanov , Farnam Mansouri 2016

We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach.

Symbolic Computation Mathematical Software

Parallel Restarted SPIDER -- Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity

124 - Pranay Sharma , Swatantra Kafle , Prashant Khanduri 2019

In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central server to perform the optimization task. The global objective is to minimize the average of local cost functions available at individual nodes. The proposed approach is a non-trivial extension of the popular parallel-restarted SGD algorithm, incorporating the optimal variance-reduction based SPIDER gradient estimator into it. We prove convergence of our algorithm to a first-order stationary solution. The proposed approach achieves the best known communication complexity $O(epsilon^{-1})$ along with the optimal computation complexity. For finite-sum problems (finite $n$), we achieve the optimal computation (IFO) complexity $O(sqrt{Nn}epsilon^{-1})$. For online problems ($n$ unknown or infinite), we achieve the optimal IFO complexity $O(epsilon^{-3/2})$. In both the cases, we maintain the linear speedup achieved by existing methods. This is a massive improvement over the $O(epsilon^{-2})$ IFO complexity of the existing approaches. Additionally, our algorithm is general enough to allow non-identical distributions of data across workers, as in the recently proposed federated learning paradigm.

Optimization and Control Distributed Parallel and Cluster Computing Machine Learning

pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation

78 - Jo~ao Dinis Ferreira , Gabriel Falcao , Juan Gomez-Luna 2021

Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which enables computation inside the memory chip. However, existing PiM architectures often lack support for complex operations, since supporting these operations increases design complexity, chip area, and power consumption. We introduce pLUTo (processing-in-memory with lookup table [LUT] operations), a new DRAM substrate that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables the efficient execution of complex operations in-memory, which has been a long-standing challenge in the domain of PiM. When running a state-of-the-art binary neural network in a single DRAM subarray, pLUTo outperforms the baseline CPU and GPU implementations by $33times$ and $8times$, respectively, while simultaneously achieving energy savings of $110times$ and $80times$.

Hardware Architecture Distributed Parallel and Cluster Computing

comments

Fetching comments

Sohag University

Additional details More universities

Parallel computation of echelon forms

Ask ChatGPT about the research

No Arabic abstract

Read More