أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Gene Cooperman

Checkpointing as a Service in Heterogeneous Cloud Environments

124 - Jiajun Cao , Matthieu Simonin , Gene Cooperman 2014

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism w ithin the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.

النظم الموزعة والتوازية والحوسبة العنقودية

Transparent Checkpoint-Restart for Hardware-Accelerated 3D Graphics

327 - Samaneh Kazemi Nafchi , Rohan Garg , Gene Cooperman 2013

Providing fault-tolerance for long-running GPU-intensive jobs requires application-specific solutions, and often involves saving the state of complex data structures spread among many graphics libraries. This work describes a mechanism for transparen t GPU-independent checkpoint-restart of 3D graphics. The approach is based on a record-prune-replay paradigm: all OpenGL calls relevant to the graphics driver state are recorded; calls not relevant to the internal driver state as of the last graphics frame prior to checkpoint are discarded; and the remaining calls are replayed on restart. A previous approach for OpenGL 1.5, based on a shadow device driver, required more than 78,000 lines of OpenGL-specific code. In contrast, the new approach, based on record-prune-replay, is used to implement the same case in just 4,500 lines of code. The speed of this approach varies between 80 per cent and nearly 100 per cent of the speed of the native hardware acceleration for OpenGL 1.5, as measured when running the ioquake3 game under Linux. This approach has also been extended to demonstrate checkpointing of OpenGL 3.0 for the first time, with a demonstration for PyMol, for molecular visualization.

أنظمة التشغيل

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC

60 - Kapil Arya , Gene Cooperman , Andrea Dotti 2013

Process checkpoint-restart is a technology with great potential for use in HEP workflows. Use cases include debugging, reducing the startup time of applications both in offline batch jobs and the High Level Trigger, permitting job preemption in envir onments where spare CPU cycles are being used opportunistically and efficient scheduling of a mix of multicore and single-threaded jobs. We report on tests of checkpoint-restart technology using CMS software, Geant4-MT (multi-threaded Geant4), and the DMTCP (Distributed Multithreaded Checkpointing) package. We analyze both single- and multi-threaded applications and test on both standard Intel x86 architectures and on Intel MIC. The tests with multi-threaded applications on Intel MIC are used to consider scalability and performance. These are considered an indicator of what the future may hold for many-core computing.

الفيزياء الحسابية النظم الموزعة والتوازية والحوسبة العنقودية التحليل العددي

A Generic Checkpoint-Restart Mechanism for Virtual Machines

72 - Rohan Garg , Komal Sodha , Gene Cooperman 2012

It is common today to deploy complex software inside a virtual machine (VM). Snapshots provide rapid deployment, migration between hosts, dependability (fault tolerance), and security (insulating a guest VM from the host). Yet, for each virtual machi ne, the code for snapshots is laboriously developed on a per-VM basis. This work demonstrates a generic checkpoint-restart mechanism for virtual machines. The mechanism is based on a plugin on top of an unmodified user-space checkpoint-restart package, DMTCP. Checkpoint-restart is demonstrated for three virtual machines: Lguest, user-space QEMU, and KVM/QEMU. The plugins for Lguest and KVM/QEMU require just 200 lines of code. The Lguest kernel driver API is augmented by 40 lines of code. DMTCP checkpoints user-space QEMU without any new code. KVM/QEMU, user-space QEMU, and DMTCP need no modification. The design benefits from other DMTCP features and plugins. Experiments demonstrate checkpoint and restart in 0.2 seconds using forked checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots.

أنظمة التشغيل

A Bit-Compatible Shared Memory Parallelization for ILU(k) Preconditioning and a Bit-Compatible Generalization to Distributed Memory

200 - Xin Dong , Gene Cooperman 2011

ILU(k) is a commonly used preconditioner for iterative linear solvers for sparse, non-symmetric systems. It is often preferred for the sake of its stability. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this important stability property. Even better, TPILU(k) preconditioning produces an answer that is bit-compatible with the sequential ILU(k) preconditioning. In terms of performance, the TPILU(k) preconditioning is shown to run faster whenever more cores are made available to it --- while continuing to be as stable as sequential ILU(k). This is in contrast to some competing methods that may become unstable if the degree of thread parallelism is raised too far. Where Block Jacobi ILU(k) fails in an application, it can be replaced by TPILU(k) in order to maintain good performance, while also achieving full stability. As a further optimization, TPILU(k) offers an optional level-based incomplete inverse method as a fast approximation for the original ILU(k) preconditioned matrix. Although this enhancement is not bit-compatible with classical ILU(k), it is bit-compatible with the output from the single-threaded version of the same algorithm. In experiments on a 16-core computer, the enhanced TPILU(k)-based iterative linear solver performed up to 9 times faster. As we approach an era of many-core computing, the ability to efficiently take advantage of many cores will become ever more important. TPILU(k) also demonstrates good performance on cluster or Grid. For example, the new algorithm achieves 50 times speedup with 80 nodes for general sparse matrices of dimension 160,000 that are diagonally dominant.

النظم الموزعة والتوازية والحوسبة العنقودية

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد