No Arabic abstract
Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (clang).
In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums corresponding to bit values at indexes 0, 1, 2, ..., k-1. If the k-bit words are one-hot encoded then the sums correspond to a frequency histogram. This multiple-sum problem is a generalization of the population-count problem where we seek the sum of all bit values. Accordingly, we refer to the multiple-sum problem as a positional population-count. Using SIMD (Single Instruction, Multiple Data) instructions from recent Intel processors, we describe algorithms for computing the 16-bit position population count using less than half of a CPU cycle per 16-bit word. Our best approach uses up to 400 times fewer instructions and is up to 50 times faster than baseline code using only regular (non-SIMD) instructions, for sufficiently large inputs.
This work describes the SIMD vectorization of the force calculation of the Lennard-Jones potential with Intel AVX2 and AVX-512 instruction sets. Since the force-calculation kernel of the molecular dynamics method involves indirect access to memory, the data layout is one of the most important factors in vectorization. We find that the Array of Structures (AoS) with padding exhibits better performance than Structure of Arrays (SoA) with appropriate vectorization and optimizations. In particular, AoS with 512-bit width exhibits the best performance among the architectures. While the difference in performance between AoS and SoA is significant for the vectorization with AVX2, that with AVX-512 is minor. The effect of other optimization techniques, such as software pipelining together with vectorization, is also discussed. We present results for benchmarks on three CPU architectures: Intel Haswell (HSW), Knights Landing (KNL), and Skylake (SKL). The performance gains by vectorization are about 42% on HSW compared with the code optimized without vectorization. On KNL, the hand-vectorized codes exhibit 34% better performance than the codes vectorized automatically by the Intel compiler. On SKL, the code vectorized with AVX2 exhibits slightly better performance than that with vectorized AVX-512.
Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Googles CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.
Motivated by recent Linear Programming solvers, we design dynamic data structures for maintaining the inverse of an $ntimes n$ real matrix under $textit{low-rank}$ updates, with polynomially faster amortized running time. Our data structure is based on a recursive application of the Woodbury-Morrison identity for implementing $textit{cascading}$ low-rank updates, combined with recent sketching technology. Our techniques and amortized analysis of multi-level partial updates, may be of broader interest to dynamic matrix problems. This data structure leads to the fastest known LP solver for general (dense) linear programs, improving the running time of the recent algorithms of (Cohen et al.19, Lee et al.19, Brand20) from $O^*(n^{2+ max{frac{1}{6}, omega-2, frac{1-alpha}{2}}})$ to $O^*(n^{2+max{frac{1}{18}, omega-2, frac{1-alpha}{2}}})$, where $omega$ and $alpha$ are the fast matrix multiplication exponent and its dual. Hence, under the common belief that $omega approx 2$ and $alpha approx 1$, our LP solver runs in $O^*(n^{2.055})$ time instead of $O^*(n^{2.16})$.
Problems of the following kind have been the focus of much recent research in the realm of parameterized complexity: Given an input graph (digraph) on $n$ vertices and a positive integer parameter $k$, find if there exist $k$ edges (arcs) whose deletion results in a graph that satisfies some specified parity constraints. In particular, when the objective is to obtain a connected graph in which all the vertices have even degrees---where the resulting graph is emph{Eulerian}---the problem is called Undirected Eulerian Edge Deletion. The corresponding problem in digraphs where the resulting graph should be strongly connected and every vertex should have the same in-degree as its out-degree is called Directed Eulerian Edge Deletion. Cygan et al. [emph{Algorithmica, 2014}] showed that these problems are fixed parameter tractable (FPT), and gave algorithms with the running time $2^{O(k log k)}n^{O(1)}$. They also asked, as an open problem, whether there exist FPT algorithms which solve these problems in time $2^{O(k)}n^{O(1)}$. In this paper we answer their question in the affirmative: using the technique of computing emph{representative families of co-graphic matroids} we design algorithms which solve these problems in time $2^{O(k)}n^{O(1)}$. The crucial insight we bring to these problems is to view the solution as an independent set of a co-graphic matroid. We believe that this view-point/approach will be useful in other problems where one of the constraints that need to be satisfied is that of connectivity.