ترغب بنشر مسار تعليمي؟ اضغط هنا

Parallel Binary Code Analysis

67   0   0.0 ( 0 )
 نشر من قبل Xiaozhu Meng
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Binary code analysis is widely used to assess a programs correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code relates to source lines, inlined functions, and data types. To date, binary analysis has been single-threaded, which is too slow for applications such as performance analysis and software forensics, where it is becoming common to analyze binaries that are gigabytes in size and in large batches that contain thousands of binaries. This paper describes our design and implementation for accelerating the task of constructing control flow graphs (CFGs) from binaries with multithreading. Existing research focuses on addressing challenging code constructs encountered during constructing CFGs, including functions sharing code, jump table analysis, non-returning functions, and tail calls. However, existing analyses do not consider the complex interactions between concurrent analysis of shared code, making it difficult to extend existing serial algorithms to be parallel. A systematic methodology to guide the design of parallel algorithms is essential. We abstract the task of constructing CFGs as repeated applications of several core CFG operations regarding to creating functions, basic blocks, and edges. We then derive properties among CFG operations, including operation dependency, commutativity, monotonicity. These operation properties guide our design of a new parallel analysis for constructing CFGs. We achieved as much as 25$times$ speedup for constructing CFGs on 64 hardware threads. Binary analysis applications are significantly accelerated with the new parallel analysis: we achieve 8$times$ for a performance analysis tool and 7$times$ for a software forensic tool with 16 hardware threads.



قيم البحث

اقرأ أيضاً

Degradation analysis is used to analyze the useful lifetimes of systems, their failure rates, and various other system parameters like mean time to failure (MTTF), mean time between failures (MTBF), and the system failure rate (SFR). In many systems, certain possible parallel paths of execution that have greater chances of success are preferred over others. Thus we introduce here the concept of probabilistic parallel choice. We use binary and $n$-ary probabilistic choice operators in describing the selections of parallel paths. These binary and $n$-ary probabilistic choice operators are considered so as to represent the complete system (described as a series-parallel system) in terms of the probabilities of selection of parallel paths and their relevant parameters. Our approach allows us to derive new and generalized formulae for system parameters like MTTF, MTBF, and SFR. We use a generalized exponential distribution, allowing distinct installation times for individual components, and use this model to derive expressions for such system parameters.
Using a realistic molecular catalyst system, we conduct scaling studies of ab initio molecular dynamics simulations using the CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures. We explore using process placement and affinity to gain additional performance improvements. We also use statistical methods to understand performance changes in spite of the variability in runtime for each molecular dynamics timestep. We found ideal conditions for CPU runs included at least four MPI ranks per node, bound evenly across each socket, and fully utilizing processing cores with one OpenMP thread per core, no benefit was shown from reserving cores for the system. The CPU-only simulations scaled at 70% or more of the ideal scaling up to 10 compute nodes, after which the returns began to diminish more quickly. Simulations on a single 40-core node with two NVIDIA V100 GPUs for acceleration achieved over 3.7x speedup compared to the fastest single 36-core node CPU-only version, and showed 13% speedup over the fastest time we achieved across five CPU-only nodes.
64 - Neil J. Gunther 2020
We prove that parallel processing with homogeneous processors is logically equivalent to fast serial processing. The reverse proposition can also be used to identify obscure opportunities for applying parallelism. To our knowledge, this theorem has n ot been previously reported in the queueing theory literature. A plausible explanation is offered for why this might be. The basic homogeneous theorem is also extended to optimizing the latency of heterogenous parallel arrays.
67 - Dominik Ernst 2021
Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning par ameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.
We present a new open source code for massive parallel computation of Voronoi tessellations(VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Vo ronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا