Do you want to publish a course? Click here

Optimized Password Recovery for Encrypted RAR on GPUs

99   0   0.0 ( 0 )
 Added by Xiaojing An
 Publication date 2015
and research's language is English




Ask ChatGPT about the research

RAR uses classic symmetric encryption algorithm SHA-1 hashing and AES algorithm for encryption, and the only method of password recovery is brute force, which is very time-consuming. In this paper, we present an approach using GPUs to speed up the password recovery process. However, because the major calculation and time-consuming part, SHA-1 hashing, is hard to be parallelized, so this paper adopts coarse granularity parallel. That is, one GPU thread is responsible for the validation of one password. We mainly use three optimization methods to optimize this parallel version: asynchronous parallel between CPU and GPU, redundant calculations and conditional statements reduction, and the usage of registers optimization. Experiment result shows that the final version reaches 43~57 times speedup on an AMD FirePro W8000 GPU, compared to a well-optimized serial version on Intel Core i5 CPU.



rate research

Read More

90 - Yanhao Chen 2019
Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstras shortest path algorithm, Prims minimum spanning tree, Huffman encoding, and the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs since the control divergence and memory irregularity must be taken into account. In this paper, we present a parallel generalized heap model that works effectively on GPUs. We also prove the linearizability of our generalized heap model which enables us to reason about the expected results. We evaluate our concurrent heap thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation and 2.11X speedup compared with the existing GPU implementation. We also apply our heap to single source shortest path with up to 1.23X speedup and 0/1 knapsack problem with up to 12.19X speedup.
Counting k-cliques in a graph is an important problem in graph analysis with many applications. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. An important optimization is to eliminate search tree branches that discover the same clique redundantly. Eliminating redundant clique discovery is typically done via graph orientation or pivoting. Parallel implementations for both of these approaches have demonstrated promising performance on CPUs. In this paper, we present our GPU implementations of k-clique counting for both the graph orientation and pivoting approaches. Our implementations explore both vertex-centric and edge-centric parallelization schemes, and replace recursive search tree traversal with iterative traversal based on an explicitly-managed shared stack. We also apply various optimizations to reduce memory consumption and improve the utilization of parallel execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 12.39x, 6.21x, and 18.99x for k = 4, 7, and 10, respectively. We also evaluate the impact of the choice of parallelization scheme and the incremental speedup of each optimization. Our code will be open-sourced to enable further research on parallelizing k-clique counting on GPUs.
In this new version of ZMCintegral, we have added the functionality of multi-function integrations, i.e. the ability to integrate more than $10^{3}$ different functions on GPUs. The Python API remains the similar as the previou
Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such as those used for seismic modeling. Furthermore, coping with boundary conditions often requires different computational logic, which complicates efficient exploitation of the thread-level parallelism on GPUs. In this paper, we study high-order stencils and their unique characteristics on GPUs. We manually crafted a collection of implementations of a 25-point seismic modeling stencil in CUDA and related boundary conditions. We evaluate their code shapes, memory hierarchy usage, data-fetching patterns, and other performance attributes. We conducted an empirical evaluation of these stencils using several mature and emerging tools and discuss our quantitative findings. Among our implementations, we achieve twice the performance of a proprietary code developed in C and mapped to GPUs using OpenACC. Additionally, several of our implementations have excellent performance portability.
Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel load balancing scheme to address this problem. Our scheme is implemented in the IrGL compiler to allow users to generate efficient load balanced code for a GPU from high-level sequential programs. We evaluated several graph analytics applications on up to 16 distributed GPUs using IrGL to compile the code and the Gluon substrate for inter-GPU communication. Our experiments show that this scheme can achieve an average speed-up of 2.2x on inputs that suffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا