DAMO: Deep Agile Mask Optimization for Full Chip Scale

81 0 0.0 ( 0 )

Download Cite

Added by Guojin Chen

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Guojin Chen - Wanli Chen - Yuzhe Ma

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Continuous scaling of the VLSI system leaves a great challenge on manufacturing and optical proximity correction (OPC) is widely applied in conventional design flow for manufacturability optimization. Traditional techniques conducted OPC by leveraging a lithography model and suffered from prohibitive computational overhead, and mostly focused on optimizing a single clip without addressing how to tackle the full chip. In this paper, we present DAMO, a high performance and scalable deep learning-enabled OPC system for full chip scale. It is an end-to-end mask optimization paradigm which contains a Deep Lithography Simulator (DLS) for lithography modeling and a Deep Mask Generator (DMG) for mask pattern generation. Moreover, a novel layout splitting algorithm customized for DAMO is proposed to handle the full chip OPC problem. Extensive experiments show that DAMO outperforms the state-of-the-art OPC solutions in both academia and industrial commercial toolkit.

rate research

CHIPKIT: An agile, reusable open-source framework for rapid test chip development

336 - Paul Whatmough , Marco Donato , Glenn Ko 2020

The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.

Hardware Architecture

Agile SoC Development with Open ESP

113 - Paolo Mantovani , Davide Giri , Giuseppe Di Guglielmo 2020

ESP is an open-source research platform for heterogeneous SoC design. The platform combines a modular tile-based architecture with a variety of application-oriented flows for the design and optimization of accelerators. The ESP architecture is highly scalable and strikes a balance between regularity and specialization. The companion methodology raises the level of abstraction to system-level design and enables an automated flow from software and hardware development to full-system prototyping on FPGA. For application developers, ESP offers domain-specific automated solutions to synthesize new accelerators for their software and to map complex workloads onto the SoC architecture. For hardware engineers, ESP offers automated solutions to integrate their accelerator designs into the complete SoC. Conceived as a heterogeneous integration platform and tested through years of teaching at Columbia University, ESP supports the open-source hardware community by providing a flexible platform for agile SoC development.

Hardware Architecture

HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation

442 - Qingcheng Xiao , Size Zheng , Bingzhe Wu 2021

Tensor computations overwhelm traditional general-purpose computing devices due to the large amounts of data and operations of the computations. They call for a holistic solution composed of both hardware acceleration and software mapping. Hardware/software (HW/SW) co-design optimizes the hardware and software in concert and produces high-quality solutions. There are two main challenges in the co-design flow. First, multiple methods exist to partition tensor computation and have different impacts on performance and energy efficiency. Besides, the hardware part must be implemented by the intrinsic functions of spatial accelerators. It is hard for programmers to identify and analyze the partitioning methods manually. Second, the overall design space composed of HW/SW partitioning, hardware optimization, and software optimization is huge. The design space needs to be efficiently explored. To this end, we propose an agile co-design approach HASCO that provides an efficient HW/SW solution to dense tensor computation. We use tensor syntax trees as the unified IR, based on which we develop a two-step approach to identify partitioning methods. For each method, HASCO explores the hardware and software design spaces. We propose different algorithms for the explorations, as they have distinct objectives and evaluation costs. Concretely, we develop a multi-objective Bayesian optimization algorithm to explore hardware optimization. For software optimization, we use heuristic and Q-learning algorithms. Experiments demonstrate that HASCO achieves a 1.25X to 1.44X latency reduction through HW/SW co-design compared with developing the hardware and software separately.

Hardware Architecture Artificial Intelligence

Rainbow: A Composable Coherence Protocol for Multi-Chip Servers

61 - Lucia G. Menezo , Valentin Puente , 2020

The use of multi-chip modules (MCM) and/or multi-socket boards is the most suitable approach to increase the computation density of servers while keep chip yield attained. This paper introduces a new coherence protocol suitable, in terms of complexity and scalability, for this class of systems. The proposal uses two complementary ideas: (1) A mechanism that dissociates complexity from performance by means of colored-token counting, (2) A construct that optimizes performance and cost by means of two functionally symmetrical modules working in the last level cache of each chip (D|F-LLC) and each memory controller (D|F-MEM). Each of these structures is divided into two parts: (2.1) The first one consists of a small loosely inclusive sparse directory where only the most actively shared data are tracked in the chip (D-LLC) from each memory controller (D-MEM) and, (2.2) The second is a d-left Counting Bloom Filter which stores approximate information about the blocks allocated, either inside the chip (F-LLC) or in the home memory controller (F-MEM). The coordinated work of both structures minimizes the coherence-related effects on the average memory latency perceived by the processor. Our proposal is able to improve on the performance of a HyperTransport-like coherence protocol by from 25%-to-60%.

Hardware Architecture

The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

525 - Andrea Biagioni , Francesca Lo Cicero , Alessandro Lonardo 2012

One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processors tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

Hardware Architecture Networking and Internet Architecture