ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient Computation of Positional Population Counts Using SIMD Instructions

74   0   0.0 ( 0 )
 نشر من قبل Daniel Lemire
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In several fields such as statistics, machine learning, and bioinformatics, categorical variables are frequently represented as one-hot encoded vectors. For example, given 8 distinct values, we map each value to a byte where only a single bit has been set. We are motivated to quickly compute statistics over such encodings. Given a stream of k-bit words, we seek to compute k distinct sums corresponding to bit values at indexes 0, 1, 2, ..., k-1. If the k-bit words are one-hot encoded then the sums correspond to a frequency histogram. This multiple-sum problem is a generalization of the population-count problem where we seek the sum of all bit values. Accordingly, we refer to the multiple-sum problem as a positional population-count. Using SIMD (Single Instruction, Multiple Data) instructions from recent Intel processors, we describe algorithms for computing the 16-bit position population count using less than half of a CPU cycle per 16-bit word. Our best approach uses up to 400 times fewer instructions and is up to 50 times faster than baseline code using only regular (non-SIMD) instructions, for sufficiently large inputs.



قيم البحث

اقرأ أيضاً

Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word (e.g., popcnt on x64 processors). Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. The benefits can be even greater for applications such as similarity measures (e.g., the Jaccard index) that require additional Boolean operations. Our approach has been adopted by LLVM: it is used by its popular C compiler (clang).
A visibility algorithm maps time series into complex networks following a simple criterion. The resulting visibility graph has recently proven to be a powerful tool for time series analysis. However its straightforward computation is time-consuming a nd rigid, motivating the development of more efficient algorithms. Here we present a highly efficient method to compute visibility graphs with the further benefit of flexibility: on-line computation. We propose an encoder/decoder approach, with an on-line adjustable binary search tree codec for time series as well as its corresponding decoder for visibility graphs. The empirical evidence suggests the proposed method for computation of visibility graphs offers an on-line computation solution at no additional computation time cost. The source code is available online.
Network reliability is an important metric to evaluate the connectivity among given vertices in uncertain graphs. Since the network reliability problem is known as #P-complete, existing studies have used approximation techniques. In this paper, we pr opose a new sampling-based approach that efficiently and accurately approximates network reliability. Our approach improves efficiency by reducing the number of samples based on stratified sampling. We theoretically guarantee that our approach improves the accuracy of approximation by using lower and upper bounds of network reliability, even though it reduces the number of samples. To efficiently compute the bounds, we develop an extended BDD, called S2BDD. During constructing the S2BDD, our approach employs dynamic programming for efficiently sampling possible graphs. Our experiment with real datasets demonstrates that our approach is up to 51.2 times faster than the existing sampling-based approach with higher accuracy.
This work describes the SIMD vectorization of the force calculation of the Lennard-Jones potential with Intel AVX2 and AVX-512 instruction sets. Since the force-calculation kernel of the molecular dynamics method involves indirect access to memory, t he data layout is one of the most important factors in vectorization. We find that the Array of Structures (AoS) with padding exhibits better performance than Structure of Arrays (SoA) with appropriate vectorization and optimizations. In particular, AoS with 512-bit width exhibits the best performance among the architectures. While the difference in performance between AoS and SoA is significant for the vectorization with AVX2, that with AVX-512 is minor. The effect of other optimization techniques, such as software pipelining together with vectorization, is also discussed. We present results for benchmarks on three CPU architectures: Intel Haswell (HSW), Knights Landing (KNL), and Skylake (SKL). The performance gains by vectorization are about 42% on HSW compared with the code optimized without vectorization. On KNL, the hand-vectorized codes exhibit 34% better performance than the codes vectorized automatically by the Intel compiler. On SKL, the code vectorized with AVX2 exhibits slightly better performance than that with vectorized AVX-512.
We present a new algorithm for computing balanced flows in equality networks arising in market equilibrium computations. The current best time bound for computing balanced flows in such networks requires $O(n)$ maxflow computations, where $n$ is the number of nodes in the network [Devanur et al. 2008]. Our algorithm requires only a single parametric flow computation. The best algorithm for computing parametric flows [Gallo et al. 1989] is only by a logarithmic factor slower than the best algorithms for computing maxflows. Hence, the running time of the algorithms in [Devanur et al. 2008] and [Duan and Mehlhorn 2015] for computing market equilibria in linear Fisher and Arrow-Debreu markets improve by almost a factor of $n$.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا