Do you want to publish a course? Click here

Best-by-Simulations: A Framework for Comparing Efficiency of Reconfigurable Multicore Architectures on Workloads with Deadlines

80   0   0.0 ( 0 )
 Added by EPTCS
 Publication date 2017
and research's language is English




Ask ChatGPT about the research

Energy consumption is a major concern in multicore systems. Perhaps the simplest strategy for reducing energy costs is to use only as many cores as necessary while still being able to deliver a desired quality of service. Motivated by earlier work on a dynamic (heterogeneous) core allocation scheme for H.264 video decoding that reduces energy costs while delivering desired frame rates, we formulate operationally the general problem of executing a sequence of actions on a reconfigurable machine while meeting a corresponding sequence of absolute deadlines, with the objective of reducing cost. Using a transition system framework that associates costs (e.g., time, energy) with executing an action on a particular resource configuration, we use the notion of amortised cost to formulate in terms of simulation relations appropriate notions for comparing deadline-conformant executions. We believe these notions can provide the basis for an operational theory of optimal cost executions and performance guarantees for approximate solutions, in particular relating the notion of simulation from transition systems to that of competitive analysis used for, e.g., online algorithms.



rate research

Read More

93 - Sanjiva Prasad 2016
Based on the two observations that diverse applications perform better on different multicore architectures, and that different phases of an application may have vastly different resource requirements, Pal et al. proposed a novel reconfigurable hardware approach for executing multithreaded programs. Instead of mapping a concurrent program to a fixed architecture, the architecture adaptively reconfigures itself to meet the applications concurrency and communication requirements, yielding significant improvements in performance. Based on our earlier abstract operational framework for multicore execution with hierarchical memory structures, we describe execution of multithreaded programs on reconfigurable architectures that support a variety of clustered configurations. Such reconfiguration may not preserve the semantics of programs due to the possible introduction of race conditions arising from concurrent accesses to shared memory by threads running on the different cores. We present an intuitive partial ordering notion on the cluster configurations, and show that the semantics of multithreaded programs is always preserved for reconfigurations upward in that ordering, whereas semantics preservation for arbitrary reconfigurations can be guaranteed for well-synchronised programs. We further show that a simple approximate notion of efficiency of execution on the different configurations can be obtained using the notion of amortised bisimulations, and extend it to dynamic reconfiguration.
OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.
In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
67 - Yossi Azar , Noam Touitou 2019
In this paper, we present a framework used to construct and analyze algorithms for online optimization problems with deadlines or with delay over a metric space. Using this framework, we present algorithms for several different problems. We present an $O(D^{2})$-competitive deterministic algorithm for online multilevel aggregation with delay on a tree of depth $D$, an exponential improvement over the $O(D^{4}2^{D})$-competitive algorithm of Bienkowski et al. (ESA 16), where the only previously-known improvement was for the special case of deadlines by Buchbinder et al. (SODA 17). We also present an $O(log^{2}n)$-competitive randomized algorithm for online service with delay over any general metric space of $n$ points, improving upon the $O(log^{4}n)$-competitive algorithm by Azar et al. (STOC 17). In addition, we present the problem of online facility location with deadlines. In this problem, requests arrive over time in a metric space, and need to be served until their deadlines by facilities that are opened momentarily for some cost. We also consider the problem of facility location with delay, in which the deadlines are replaced with arbitrary delay functions. For those problems, we present $O(log^{2}n)$-competitive algorithms, with $n$ the number of points in the metric space. The algorithmic framework we present includes techniques for the design of algorithms as well as techniques for their analysis.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا