Research papers, master and doctoral theses about Performance

Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero

88 - Markus Wittmann , Thomas Zeiser , Georg Hager 2015

Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underflow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.

Performance

Supercharge me: Boost Router Convergence with SDN

110 - Michael Alan Chang , Thomas Holterbach , Markus Happe 2015

Software Defined Networking (SDN) is a promising approach for improving the performance and manageability of future network architectures. However, little work has gone into using SDN to improve the performance and manageability of existing networks without requiring a major overhaul of the existing network infrastructure. In this paper, we show how we can dramatically improve, or supercharge, the performance of existing IP routers by combining them with SDN-enabled equipment in a novel way. More particularly, our supercharged solution substantially reduces the convergence time of an IP router upon link or node failure without inducing any reconfiguration of the IP router itself. Our key insight is to use the SDN controller to precompute backup forwarding entries and immediately activate them upon failure, enabling almost immediate data-plane recovery, while letting the router converge at its typical slow pace. By boosting existing equipments performance, we not only increase their lifetime but also provide new incentives for network operators to kickstart SDN deployment. We implemented a fully functional supercharger and use it to boost the convergence performance of a Cisco Nexus 7k router. Using a FPGA-based traffic generator, we show that our supercharged router systematically converges within ~150ms, a 900x reduction with respect to its normal convergence time under similar conditions.

Networking and Internet Architecture Performance

Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

77 - J. P. Briggs , S. J. Pennycook , J. R. Fergusson 2015

We present a case study describing efforts to optimise and modernise Modal, the simulation and analysis pipeline used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum (or three-point correlator) of the cosmic microwave background radiation. We focus on one particular element of the code: the projection of bispectra from the end of inflation to the spherical shell at decoupling, which defines the CMB we observe today. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular domain containing a sparse grid. We show that by employing separable methods this calculation can be reduced to a one-dimensional summation plus two integrations, reducing the overall dimensionality from four to three. The introduction of separable functions also solves the issue of the non-rectangular sparse grid. This separable method can become unstable in certain cases and so the slower non-separable integral must be calculated instead. We present a discussion of the optimisation of both approaches. We show significant speed-ups of ~100x, arising from a combination of algorithmic improvements and architecture-aware optimisations targeted at improving thread and vectorisation behaviour. The resulting MPI/OpenMP hybrid code is capable of executing on clusters containing processors and/or coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3x and that running the same code across a combination of both microarchitectures improves performance-per-node by a factor of 3.38x. By making bispectrum calculations competitive with those for the power spectrum (or two-point correlator) we are now able to consider joint analysis for cosmological science exploitation of new data.

Distributed Parallel and Cluster Computing Cosmology and Nongalactic Astrophysics Performance

GraphMat: High performance graph analytics made productive

464 - Narayanan Sundaram , Nadathur Rajagopalan Satish , Md Mostofa Alin Patwary 2015

Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMatcan naturally benefit from the trend of increasing parallelism on future hardware.

Performance Databases Distributed Parallel and Cluster Computing

Faster 64-bit universal hashing using carry-less multiplications

219 - Daniel Lemire , Owen Kaser 2015

Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Googles CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.

Data Structures and Algorithms Performance

Stochastic HYPE: Flow-based modelling of stochastic hybrid systems

101 - Luca Bortolussi , Vashti Galpin , Jane Hillston 2014

Stochastic HYPE is a novel process algebra that models stochastic, instantaneous and continuous behaviour. It develops the flow-based approach of the hybrid process algebra HYPE by replacing non-urgent events with events with exponentially-distributed durations and also introduces random resets. The random resets allow for general stochasticity, and in particular allow for the use of event durations drawn from distributions other than the exponential distribution. To account for stochasticity, the semantics of stochastic HYPE target piecewise deterministic Markov processes (PDMPs), via intermediate transition-driven stochastic hybrid automata (TDSHA) in contrast to the hybrid automata used as semantic target for HYPE. Stochastic HYPE models have a specific structure where the controller of a system is separate from the continuous aspect of this system providing separation of concerns and supporting reasoning. A novel equivalence is defined which captures when two models have the same stochastic behaviour (as in stochastic bisimulation), instantaneous behaviour (as in classical bisimulation) and continuous behaviour. These techniques are illustrated via an assembly line example.

Systems and Control Logic in Computer Science Performance

Characterizing and Subsetting Big Data Workloads

386 - Zhen Jia , Jianfeng Zhan , Lei Wang 2014

Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.

Performance

COFFEE: an Optimizing Compiler for Finite Element Local Assembly

156 - Fabio Luporini , Ana Lucia Varbanescu , Florian Rathgeber 2014

The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.

Mathematical Software Computational Engineering Performance

Separation of timescales in a two-layered network

61 - Maria Vlasiou , Jiheng Zhang , Bert Zwart 2014

We investigate a computer network consisting of two layers occurring in, for example, application servers. The first layer incorporates the arrival of jobs at a network of multi-server nodes, which we model as a many-server Jackson network. At the second layer, active servers at these nodes act now as customers who are served by a common CPU. Our main result shows a separation of time scales in heavy traffic: the main source of randomness occurs at the (aggregate) CPU layer; the interactions between different types of nodes at the other layer is shown to converge to a fixed point at a faster time scale; this also yields a state-space collapse property. Apart from these fundamental insights, we also obtain an explicit approximation for the joint law of the number of jobs in the system, which is provably accurate for heavily loaded systems and performs numerically well for moderately loaded systems. The obtained results for the model under consideration can be applied to thread-pool dimensioning in application servers, while the technique seems applicable to other layered systems too.

Probability Performance

Degradation Analysis of Probabilistic Parallel Choice Systems

193 - Avinash Saxena , Shrisha Rao 2014

Degradation analysis is used to analyze the useful lifetimes of systems, their failure rates, and various other system parameters like mean time to failure (MTTF), mean time between failures (MTBF), and the system failure rate (SFR). In many systems, certain possible parallel paths of execution that have greater chances of success are preferred over others. Thus we introduce here the concept of probabilistic parallel choice. We use binary and $n$-ary probabilistic choice operators in describing the selections of parallel paths. These binary and $n$-ary probabilistic choice operators are considered so as to represent the complete system (described as a series-parallel system) in terms of the probabilities of selection of parallel paths and their relevant parameters. Our approach allows us to derive new and generalized formulae for system parameters like MTTF, MTBF, and SFR. We use a generalized exponential distribution, allowing distinct installation times for individual components, and use this model to derive expressions for such system parameters.

Performance Statistics Theory Statistics Theory

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد