ترغب بنشر مسار تعليمي؟ اضغط هنا

Energy-efficiency evaluation of Intel KNL for HPC workloads

104   0   0.0 ( 0 )
 نشر من قبل Alessandro Gabbana
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze its energy performance using both the on-chip MCDRAM and the regular DDR4 system memory as main storage for the application data-domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture and implemented using different memory data layouts to store its lattice. We assessthen the energy consumption using different memory data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core.



قيم البحث

اقرأ أيضاً

Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all crit ical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager and makes the following contributions: (1) a new machine learning model to predict the performance degradation of colocated applications based on hardware counters and (2) an intelligent scheduling scheme deployed on an existing resource manager to enable application co-scheduling with minimum performance degradation. Our results show that our approach achieves performance improvements of 7% (avg) and 12% (max) compared to the standard policy commonly used by existing job managers.
Byte-addressable non-volatile memory (NVM) features high density, DRAM comparable performance, and persistence. These characteristics position NVM as a promising new tier in the memory hierarchy. Nevertheless, NVM has asymmetric read and write perfor mance, and considerably higher write energy than DRAM. Our work provides an in-depth evaluation of the first commercially available byte-addressable NVM -- the Intel Optane DC persistent memory. The first part of our study quantifies the latency, bandwidth, power efficiency, and energy consumption under eight memory configurations. We also evaluate the real impact on in-memory graph processing workloads. Our results show that augmenting NVM with DRAM is essential, and the combination can effectively bridge the performance gap and provide reasonable performance with higher capacity. We also identify NUMA-related performance characteristics for accesses to memory on a remote socket. In the second part, we employ two fine-grained allocation policies to control traffic distribution between DRAM and NVM. Our results show that bandwidth spilling between DRAM and NVM could provide 2.0x bandwidth and enable $20%$ larger problems than using DRAM as a cache. Also, write isolation between DRAM and NVM could save up to 3.9x energy and improves bandwidth by 3.1x compared to DRAM-cached NVM. We establish a roofline model to explore power and energy efficiency at various distributions of read-only traffic. Our results show that NVM requires 1.8x lower power than DRAM for data-intensive workloads. Overall, applications can significantly optimize performance and power efficiency by adapting traffic distribution to NVM and DRAM through memory configurations and fine-grained policies to fully exploit the new memory device.
We review our work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is capable of massive thread p arallelism, data parallelism, and high on-board memory bandwidth and is being adopted in supercomputing centers for scientific research. The CG solver consumes the majority of time in production running, so we have spent most of our effort on it. We compare performance of an MPI+OpenMP baseline version of the MILC code with a version incorporating the QPhiX staggered CG solver, for both one-node and multi-node runs.
We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knights Landing) processors, and using the Linux operatin g system. The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size. We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB huge virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library Technology Preview and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.
HPC centers face increasing demand for software flexibility, and there is growing consensus that Linux containers are a promising solution. However, existing container build solutions require root privileges and cannot be used directly on HPC resourc es. This limitation is compounded as supercomputer diversity expands and HPC architectures become more dissimilar from commodity computing resources. Our analysis suggests this problem can best be solved with low-privilege containers. We detail relevant Linux kernel features, propose a new taxonomy of container privilege, and compare two open-source implementations: mostly-unprivileged rootless Podman and fully-unprivileged Charliecloud. We demonstrate that low-privilege container build on HPC resources works now and will continue to improve, giving normal users a better workflow to securely and correctly build containers. Minimizing privilege in this way can improve HPC user and developer productivity as well as reduce support workload for exascale applications.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا