New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Hierarchical Memory Management for Mutable State

175 0 0.0 ( 0 )

Download Cite

Added by Adrien Guatto

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Adrien Guatto - Sam Westrick - Ram Raghunathan

Programming Languages Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

It is well known that modern functional programming languages are naturally amenable to parallel programming. Achieving efficient parallelism using functional languages, however, remains difficult. Perhaps the most important reason for this is their lack of support for efficient in-place updates, i.e., mutation, which is important for the implementation of both parallel algorithms and the run-time system services (e.g., schedulers and synchronization primitives) used to execute them. In this paper, we propose techniques for efficient mutation in parallel functional languages. To this end, we couple the memory manager with the thread scheduler to make reading and updating data allocated by nested threads efficient. We describe the key algorithms behind our technique, implement them in the MLton Standard ML compiler, and present an empirical evaluation. Our experiments show that the approach performs well, significantly improving efficiency over existing functional language implementations.

rate research

Native Implementation of Mutable Value Semantics

151 - Dimitri Racordon , Denys Shabalin , Daniel Zheng 2021

Unrestricted mutation of shared state is a source of many well-known problems. The predominant safe solutions are pure functional programming, which bans mutation outright, and flow sensitive type systems, which depend on sophisticated typing rules. Mutable value semantics is a third approach that bans sharing instead of mutation, thereby supporting part-wise in-place mutation and local reasoning, while maintaining a simple type system. In the purest form of mutable value semantics, references are second-class: they are only created implicitly, at function boundaries, and cannot be stored in variables or object fields. Hence, variables can never share mutable state. Because references are often regarded as an indispensable tool to write efficient programs, it is legitimate to wonder whether such a discipline can compete other approaches. As a basis for answering that question, we demonstrate how a language featuring mutable value semantics can be compiled to efficient native code. This approach relies on stack allocation for static garbage collection and leverages runtime knowledge to sidestep unnecessary copies.

Programming Languages

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

91 - Jiarui Fang , Yang Yu , Shenggui Li 2021

The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. It learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset. Unfortunately, PTM training requires prohibitively expensive computing devices, especially fine-tuning, which is still a game for a small proportion of people in the AI community. Enabling PTMs training on low-quality devices, PatrickStar now makes PTM accessible to everyone. PatrickStar reduces memory requirements of computing platforms by using the CPU-GPU heterogeneous memory space to store model data, consisting of parameters, gradients, and optimizer states. We observe that the GPU memory available for model data changes regularly, in a tide-like pattern, decreasing and increasing iteratively. However, the existing heterogeneous training works do not take advantage of this pattern. Instead, they statically partition the model data among CPU and GPU, leading to both memory waste and memory abuse. In contrast, PatrickStar manages model data in chunks, which are dynamically distributed in heterogeneous memory spaces. Chunks consist of stateful tensors which run as finite state machines during training. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs using data parallelism, with the lowest communication bandwidth requirements and more efficient bandwidth utilization. Experimental results show PatrickStar trains a 12 billion parameters GPT model, 2x larger than the STOA work, on an 8-V100 and 240GB CPU memory node, and is also more efficient on the same model size.

Machine Learning Distributed Parallel and Cluster Computing

Hardware Memory Management for Future Mobile Hybrid Memory Systems

83 - Fei Wen , Mian Qin , Paul Gratz 2020

The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Further, DRAM scaling trends make further growth of DRAMin the mobile space prohibitive due to cost. Emerging non-volatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and mini-mal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCM), memristor, and 3D XPoint have emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAMand NVM writes can incur higher latencies and wear costs. Therefore integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitect-ing of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses both types of memory in a flat space address space. We design a set of data placement and data migration policies within this memory manager, such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing writes to the NVM. Experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future.

Hardware Architecture Operating Systems

Continual Distributed Learning for Crisis Management

58 - Aman Priyanshu , Mudit Sinha , Shreyans Mehta 2021

Social media platforms such as Twitter, Facebook etc can be utilised as an important source of information during disaster events. This information can be used for disaster response and crisis management if processed accurately and quickly. However, the data present in such situations is ever-changing, and using considerable resources during such a crisis is not feasible. Therefore, we have to develop a low resource and continually learning system that incorporates text classification models which are robust against noisy and unordered data. We utilised Distributed learning which enabled us to learn on resource-constrained devices, then to alleviate catastrophic forgetting in our target neural networks we utilized regularization. We then applied federated averaging for distributed learning and to aggregate the central model for continual learning.

Machine Learning Distributed Parallel and Cluster Computing

Multiparty Compatibility for Concurrent Objects

137 - Roly Perera 2016

Objects and actors are communicating state machines, offering and consuming different services at different points in their lifecycle. Two complementary challenges arise when programming such systems. When objects interact, their state machines must be compatible, so that services are requested only when they are available. Dually, when objects refine other objects, their state machines must be compliant, so that services are honoured whenever they are promised. In this paper we show how the idea of multiparty compatibility from the session types literature can be applied to both of these problems. We present an untyped language in which concurrent objects are checked automatically for compatibility and compliance. For simple objects, checking can be exhaustive and has the feel of a type system. More complex objects can be partially validated via test cases, leading to a methodology closer to continuous testing. Our proof-of-concept implementation is limited in some important respects, but demonstrates the potential value of the approach and the relationship to existing software development practices.

Programming Languages Distributed Parallel and Cluster Computing Software Engineering

comments

Fetching comments

Damascus University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Hierarchical Memory Management for Mutable State

Ask ChatGPT about the research

No Arabic abstract

Read More