No Arabic abstract
The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training and inference. First, we present on-device training with ultra-low memory usage. We propose a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduce an ultra-low bitwidth quantization method for DNN model compression, achieving the state-of-the-art accuracy under the same compression ratio. Third, we introduce an ultra-low latency DNN accelerator design, practicing the software/hardware co-design methodology. This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.
Effective Capacity defines the maximum communication rate subject to a specific delay constraint, while effective energy efficiency (EEE) indicates the ratio between effective capacity and power consumption. We analyze the EEE of ultra-reliable networks operating in the finite blocklength regime. We obtain a closed form approximation for the EEE in quasi-static Nakagami-$m$ (and Rayleigh as sub-case) fading channels as a function of power, error probability, and latency. Furthermore, we characterize the QoS constrained EEE maximization problem for different power consumption models, which shows a significant difference between finite and infinite blocklength coding with respect to EEE and optimal power allocation strategy. As asserted in the literature, achieving ultra-reliability using one transmission consumes huge amount of power, which is not applicable for energy limited IoT devices. In this context, accounting for empty buffer probability in machine type communication (MTC) and extending the maximum delay tolerance jointly enhances the EEE and allows for adaptive retransmission of faulty packets. Our analysis reveals that obtaining the optimum error probability for each transmission by minimizing the non-empty buffer probability approaches EEE optimality, while being analytically tractable via Dinkelbachs algorithm. Furthermore, the results illustrate the power saving and the significant EEE gain attained by applying adaptive retransmission protocols, while sacrificing a limited increase in latency.
Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59times$ speedup and $123times$ energy reduction compared to embedded CPU, and $292times$ memory reduction over a standard full-size training.
Inspired by recent work on extended image volumes that lays the ground for randomized probing of extremely large seismic wavefield matrices, we present a memory frugal and computationally efficient inversion methodology that uses techniques from randomized linear algebra. By means of a carefully selected realistic synthetic example, we demonstrate that we are capable of achieving competitive inversion results at a fraction of the memory cost of conventional full-waveform inversion with limited computational overhead. By exchanging memory for negligible computational overhead, we open with the presented technology the door towards the use of low-memory accelerators such as GPUs.
Traditional link adaptation (LA) schemes in cellular network must be revised for networks beyond the fifth generation (b5G), to guarantee the strict latency and reliability requirements advocated by ultra reliable low latency communications (URLLC). In particular, a poor error rate prediction potentially increases retransmissions, which in turn increase latency and reduce reliability. In this paper, we present an interference prediction method to enhance LA for URLLC. To develop our prediction method, we propose a kernel based probability density estimation algorithm, and provide an in depth analysis of its statistical performance. We also provide a low complxity version, suitable for practical scenarios. The proposed scheme is compared with state-of-the-art LA solutions over fully compliant 3rd generation partnership project (3GPP) calibrated channels, showing the validity of our proposal.
Considering a Manhattan mobility model in vehicle-to-vehicle networks, this work studies a power minimization problem subject to second-order statistical constraints on latency and reliability, captured by a network-wide maximal data queue length. We invoke results in extreme value theory to characterize statistics of extreme events in terms of the maximal queue length. Subsequently, leveraging Lyapunov stochastic optimization to deal with network dynamics, we propose two queue-aware power allocation solutions. In contrast with the baseline, our approaches achieve lower mean and variance of the maximal queue length.