Communication-Computation Trade-Off in Resource-Constrained Edge Inference

138 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jiawei Shao

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية هندسة إلكترونية

والبحث باللغة English

تأليف Jiawei Shao - Jun Zhang

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The recent breakthrough in artificial intelligence (AI), especially deep neural networks (DNNs), has affected every branch of science and technology. Particularly, edge AI has been envisioned as a major application scenario to provide DNN-based services at edge devices. This article presents effective methods for edge inference at resource-constrained devices. It focuses on device-edge co-inference, assisted by an edge computing server, and investigates a critical trade-off among the computation cost of the on-device model and the communication cost of forwarding the intermediate feature to the edge server. A three-step framework is proposed for the effective inference: (1) model split point selection to determine the on-device model, (2) communication-aware model compression to reduce the on-device computation and the resulting communication overhead simultaneously, and (3) task-oriented encoding of the intermediate feature to further reduce the communication overhead. Experiments demonstrate that our proposed framework achieves a better trade-off and significantly reduces the inference latency than baseline methods.

قيم البحث

314 - Xinjie Zhang , Jiawei Shao , Yuyi Mao 2021

Device-edge co-inference, which partitions a deep neural network between a resource-constrained mobile device and an edge server, recently emerges as a promising paradigm to support intelligent mobile applications. To accelerate the inference process , on-device model sparsification and intermediate feature compression are regarded as two prominent techniques. However, as the on-device model sparsity level and intermediate feature compression ratio have direct impacts on computation workload and communication overhead respectively, and both of them affect the inference accuracy, finding the optimal values of these hyper-parameters brings a major challenge due to the large search space. In this paper, we endeavor to develop an efficient algorithm to determine these hyper-parameters. By selecting a suitable model split point and a pair of encoder/decoder for the intermediate feature vector, this problem is casted as a sequential decision problem, for which, a novel automated machine learning (AutoML) framework is proposed based on deep reinforcement learning (DRL). Experiment results on an image classification task demonstrate the effectiveness of the proposed framework in achieving a better communication-computation trade-off and significant inference speedup against various baseline schemes.

التعلم الآلي الذكاء الاصطناعي معالجة الإشارات

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

370 - Tayyebeh Jahani-Nezhad , Mohammad Ali Maddah-Ali 2021

Gradient coding allows a master node to derive the aggregate of the partial gradients, calculated by some worker nodes over the local data sets, with minimum communication cost, and in the presence of stragglers. In this paper, for gradient coding wi th linear encoding, we characterize the optimum communication cost for heterogeneous distributed systems with emph{arbitrary} data placement, with $s in mathbb{N}$ stragglers and $a in mathbb{N}$ adversarial nodes. In particular, we show that the optimum communication cost, normalized by the size of the gradient vectors, is equal to $(r-s-2a)^{-1}$, where $r in mathbb{N}$ is the minimum number that a data partition is replicated. In other words, the communication cost is determined by the data partition with the minimum replication, irrespective of the structure of the placement. The proposed achievable scheme also allows us to target the computation of a polynomial function of the aggregated gradient matrix. It also allows us to borrow some ideas from approximation computing and propose an approximate gradient coding scheme for the cases when the repetition in data placement is smaller than what is needed to meet the restriction imposed on communication cost or when the number of stragglers appears to be more than the presumed value in the system design.

نظرية المعلومات التعلم الآلي نظرية المعلومات

Optimizing Pipelined Computation and Communication for Latency-Constrained Edge Learning

67 - Nicolas Skatchkovsky , Osvaldo Simeone 2019

Consider a device that is connected to an edge processor via a communication channel. The device holds local data that is to be offloaded to the edge processor so as to train a machine learning model, e.g., for regression or classification. Transmiss ion of the data to the learning processor, as well as training based on Stochastic Gradient Descent (SGD), must be both completed within a time limit. Assuming that communication and computation can be pipelined, this letter investigates the optimal choice for the packet payload size, given the overhead of each data packet transmission and the ratio between the computation and the communication rates. This amounts to a tradeoff between bias and variance, since communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Analytical bounds on the expected optimality gap are derived so as to enable an effective optimization, which is validated in numerical results.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

RCT: Resource Constrained Training for Edge AI

328 - Tian Huang , Tao Luo , Ming Yan 2021

Neural networks training on edge terminals is essential for edge AI computing, which needs to be adaptive to evolving environment. Quantised models can efficiently run on edge devices, but existing training methods for these compact models are design ed to run on powerful servers with abundant memory and energy budget. For example, quantisation-aware training (QAT) method involves two copies of model parameters, which is usually beyond the capacity of on-chip memory in edge devices. Data movement between off-chip and on-chip memory is energy demanding as well. The resource requirements are trivial for powerful servers, but critical for edge devices. To mitigate these issues, We propose Resource Constrained Training (RCT). RCT only keeps a quantised model throughout the training, so that the memory requirements for model parameters in training is reduced. It adjusts per-layer bitwidth dynamically in order to save energy when a model can learn effectively with lower precision. We carry out experiments with representative models and tasks in image application and natural language processing. Experiments show that RCT saves more than 86% energy for General Matrix Multiply (GEMM) and saves more than 46% memory for model parameters, with limited accuracy loss. Comparing with QAT-based method, RCT saves about half of energy on moving model parameters.

التعلم الآلي الحوسبة العصبية والتطورية

Efficient Resource Allocation for Relay-Assisted Computation Offloading in Mobile Edge Computing

236 - Xihan Chen , Yunlong Cai , Qingjiang Shi 2019

In this article, we consider the problem of relay assisted computation offloading (RACO), in which user A aims to share the results of computational tasks with another user B through wireless exchange over a relay platform equipped with mobile edge c omputing capabilities, referred to as a mobile edge relay server (MERS). To support the computation offloading, we propose a hybrid relaying (HR) approach employing two orthogonal frequency bands, where the amplify-and-forward scheme is used in one band to exchange computational results, while the decode-and-forward scheme is used in the other band to transfer the unprocessed tasks. The motivation behind the proposed HR scheme for RACO is to adapt the allocation of computing and communication resources both to dynamic user requirements and to diverse computational tasks. Within this framework, we seek to minimize the weighted sum of the execution delay and the energy consumption in the RACO system by jointly optimizing the computation offloading ratio, the bandwidth allocation, the processor speeds, as well as the transmit power levels of both user $A$ and the MERS, under practical constraints on the available computing and communication resources. The resultant problem is formulated as a non-differentiable and nonconvex optimization program with highly coupled constraints. By adopting a series of transformations and introducing auxiliary variables, we first convert this problem into a more tractable yet equivalent form. We then develop an efficient iterative algorithm for its solution based on the concave-convex procedure. By exploiting the special structure of this problem, we also propose a simplified algorithm based on the inexact block coordinate descent method, with reduced computational complexity. Finally, we present numerical results that illustrate the advantages of the proposed algorithms over state-of-the-art benchmark schemes.

نظرية المعلومات معالجة الإشارات نظرية المعلومات