No Arabic abstract
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (GF(2)). In particular we present our implementation -- in the M4RI library -- of Strassen-Winograd matrix multiplication and the Method of the Four Russians multiplication (M4RM) and compare it against other available implementations. Good performance is demonstrated on on AMDs Opteron and particulary good performance on Intels Core 2 Duo. The open-source M4RI library is available stand-alone as well as part of the Sage mathematics software. In machine terms, addition in GF(2) is logical-XOR, and multiplication is logical-AND, thus a machine word of 64-bits allows one to operate on 64 elements of GF(2) in parallel: at most one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations over GF(2) are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.
In this work we describe an efficient implementation of a hierarchy of algorithms for the decomposition of dense matrices over the field with two elements (GF(2)). Matrix decomposition is an essential building block for solving dense systems of linear and non-linear equations and thus much research has been devoted to improve the asymptotic complexity of such algorithms. In this work we discuss an implementation of both well-known and improved algorithms in the M4RI library. The focus of our discussion is on a new variant of the M4RI algorithm - denoted MMPF in this work -- which allows for considerable performance gains in practice when compared to the previously fastest implementation. We provide performance figures on x86_64 CPUs to demonstrate the viability of our approach.
We propose several new schedules for Strassen-Winograds matrix multiplication algorithm, they reduce the extra memory allocation requirements by three different means: by introducing a few pre-additions, by overwriting the input matrices, or by using a first recursive level of classical multiplication. In particular, we show two fully in-place schedules: one having the same number of operations, if the input matrices can be overwritten; the other one, slightly increasing the constant of the leading term of the complexity, if the input matrices are read-only. Many of these schedules have been found by an implementation of an exhaustive search algorithm based on a pebble game.
We introduce a consistent and efficient method to construct self-dual codes over $GF(q)$ with symmetric generator matrices from a self-dual code over $GF(q)$ of smaller length where $q equiv 1 pmod 4$. Using this method, we improve the best-known minimum weights of self-dual codes, which have not significantly improved for almost two decades. We focus on a class of self-dual codes, including double circulant codes. Using our method, called a `symmetric building-up construction, we obtain many new self-dual codes over $GF(13)$ and $GF(17)$ and improve the bounds of best-known minimum weights of self-dual codes of lengths up to 40. Besides, we compute the minimum weights of quadratic residue codes that were not known before. These are: a [20,10,10] QR self-dual code over $GF(23)$, two [24,12,12] QR self-dual codes over $GF(29)$ and $GF(41)$, and a [32,12,14] QR self-dual codes over $GF(19)$. They have the highest minimum weights so far.
Four recursive constructions of permutation polynomials over $gf(q^2)$ with those over $gf(q)$ are developed and applied to a few famous classes of permutation polynomials. They produce infinitely many new permutation polynomials over $gf(q^{2^ell})$ for any positive integer $ell$ with any given permutation polynomial over $gf(q)$. A generic construction of permutation polynomials over $gf(2^{2m})$ with o-polynomials over $gf(2^m)$ is also presented, and a number of new classes of permutation polynomials over $gf(2^{2m})$ are obtained.
Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6%, 28.5%, 173.0% and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5