ﻻ يوجد ملخص باللغة العربية
We propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occur. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency.
To interpolate a supersparse polynomial with integer coefficients, two alternative approaches are the Prony-based big prime technique, which acts over a single large finite field, or the more recently-proposed small primes technique, which reduces th
After an introduction to the sequential version of FORM and the mechanisms behind, we report on the status of our project of parallelization. We have now a parallel version of FORM running on Cluster- and SMP-architectures. This version can be used to run arbitrary FORM programs in parallel.
We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach.
In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central
Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which en