Do you want to publish a course? Click here

Power Regulation in High Performance Multicore Processors

95   0   0.0 ( 0 )
 Added by Yorai Wardi
 Publication date 2017
and research's language is English




Ask ChatGPT about the research

This paper presents, implements, and evaluates a power-regulation technique for multicore processors, based on an integral controller with adjustable gain. The gain is designed for wide stability margins, and computed in real time as part of the control law. The tracking performance of the control system is robust with respect to modeling uncertainties and computational errors in the loop. The main challenge of designing such a controller is that the power dissipation of program-workloads varies widely and often cannot be measured accurately; hence extant controllers are either ad hoc or based on a-priori modeling characterizations of the processor and workloads. Our approach is different. Leveraging the aforementioned robustness it uses a simple textbook modeling framework, and adjusts its parameters in real time by a system-identification module. In this it trades modeling precision for fast computations in the loop making it suitable for on-line implementation in commodity data-center processors. Consequently, the proposed controller is agnostic in the sense that it does not require any a-priori system characterizations. We present an implementation of the controller on Intels fourth-generation microarchitecture, Haswell, and test it on a number of industry benchmark programs which are used in scientific computing and datacenter applications. Results of these experiments are presented in detail exposing the practical challenges of implementing provably-convergent power regulation solutions in commodity multicore processors.



rate research

Read More

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.
In this paper, we use multithreaded fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to present a novel model-based parallel computing technique as a very effective and portable method for optimization of scientific multithreaded routines for performance, especially in the current multicore era where the processors have abundant number of cores. We propose two optimization methods, PFFT-FPM and PFFT-FPM-PAD, based on this technique. They compute 2D-DFT of a complex signal matrix of size NxN using p abstract processors. Both algorithms take as inputs, discrete 3D functions of performance against problem size of the processors and output the transformed signal matrix. Based on our experiments on a modern Intel Haswell multicore server consisting of 36 physical cores, the average and maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9x and 6.8x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.3x and 2x respectively. The average and maximum speedups observed for PFFT-FPM-PAD using FFTW-3.3.7 are 2x and 9.4x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.4x and 5.9x respectively.
177 - Feng Xia , Liping Liu , Longhua Ma 2008
The goal of this work is to minimize the energy dissipation of embedded controllers without jeopardizing the quality of control (QoC). Taking advantage of the dynamic voltage scaling (DVS) technology, this paper develops a performance-aware power management scheme for embedded controllers with processors that allow multiple voltage levels. The periods of control tasks are adapted online with respect to the current QoC, thus facilitating additional energy reduction over standard DVS. To avoid the waste of CPU resources as a result of the discrete voltage levels, a resource reclaiming mechanism is employed to maximize the CPU utilization and also to improve the QoC. Simulations are conducted to evaluate the performance of the proposed scheme. Compared with the optimal standard DVS scheme, the proposed scheme is shown to be able to save remarkably more energy while maintaining comparable QoC.
95 - Junyao Guo , Gabriela Hug , 2016
Distributed optimization for solving non-convex Optimal Power Flow (OPF) problems in power systems has attracted tremendous attention in the last decade. Most studies are based on the geographical decomposition of IEEE test systems for verifying the feasibility of the proposed approaches. However, it is not clear if one can extrapolate from these studies that those approaches can be applied to very large-scale real-world systems. In this paper, we show, for the first time, that distributed optimization can be effectively applied to a large-scale real transmission network, namely, the Polish 2383-bus system for which no pre-defined partitions exist, by using a recently developed partitioning technique. More specifically, the problem solved is the AC OPF problem with geographical decomposition of the network using the Alternating Direction Method of Multipliers (ADMM) method in conjunction with the partitioning technique. Through extensive experimental results and analytical studies, we show that with the presented partitioning technique the convergence performance of ADMM can be improved substantially, which enables the application of distributed approaches on very large-scale systems.
Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modifi
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا