Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems

92 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Antonio Ginart

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Antonio Ginart - Maxim Naumov - Dheevatsa Mudigere

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Embedding representations power machine intelligence in many applications, including recommendation systems, but they are space intensive -- potentially occupying hundreds of gigabytes in large-scale settings. To help manage this outsized memory consumption, we explore mixed dimension embeddings, an embedding layer architecture in which a particular embedding vectors dimension scales with its query frequency. Through theoretical analysis and systematic experiments, we demonstrate that using mixed dimensions can drastically reduce the memory usage, while maintaining and even improving the ML performance. Empirically, we show that the proposed mixed dimension layers improve accuracy by 0.1% using half as many parameters or maintain it using 16X fewer parameters for click-through rate prediction task on the Criteo Kaggle dataset.

قيم البحث

109 - Hao-Jun Michael Shi , Dheevatsa Mudigere , Maxim Naumov 2019

Modern deep learning-based recommendation systems exploit hundreds to thousands of different categorical features, each with millions of different categories ranging from clicks to posts. To respect the natural diversity within the categorical data, embeddings map each category to a unique dense representation within an embedded space. Since each categorical feature could take on as many as tens of millions of different possible categories, the embedding tables form the primary memory bottleneck during both training and inference. We propose a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. By storing multiple smaller embedding tables based on each complementary partition and combining embeddings from each table, we define a unique embedding for each category at smaller memory cost. This approach may be interpreted as using a specific fixed codebook to ensure uniqueness of each categorys representation. Our experimental results demonstrate the effectiveness of our approach over the hashing trick for reducing the size of the embedding tables in terms of model loss and accuracy, while retaining a similar reduction in the number of parameters.

التعلم الآلي استرجاع المعلومات التعلم الالي

Fast Fixed Dimension L2-Subspace Embeddings of Arbitrary Accuracy, With Application to L1 and L2 Tasks

46 - Malik Magdon-Ismail , Alex Gittens 2019

We give a fast oblivious L2-embedding of $Ain mathbb{R}^{n x d}$ to $Bin mathbb{R}^{r x d}$ satisfying $(1-varepsilon)|A x|_2^2 le |B x|_2^2 <= (1+varepsilon) |Ax|_2^2.$ Our embedding dimension $r$ equals $d$, a constant independent of the distortion $varepsilon$. We use as a black-box any L2-embedding $Pi^T A$ and inherit its runtime and accuracy, effectively decoupling the dimension $r$ from runtime and accuracy, allowing downstream machine learning applications to benefit from both a low dimension and high accuracy (in prior embeddings higher accuracy means higher dimension). We give applications of our L2-embedding to regression, PCA and statistical leverage scores. We also give applications to L1: 1.) An oblivious L1-embedding with dimension $d+O(dln^{1+eta} d)$ and distortion $O((dln d)/lnln d)$, with application to constructing well-conditioned bases; 2.) Fast approximation of L1-Lewis weights using our L2 embedding to quickly approximate L2-leverage scores.

التعلم الآلي التعلم الالي

Memory-efficient training with streaming dimensionality reduction

85 - Siyuan Huang , Brian D. Hoskins , Matthew W. Daniels 2020

The movement of large quantities of data during the training of a Deep Neural Network presents immense challenges for machine learning workloads. To minimize this overhead, especially on the movement and calculation of gradient information, we introd uce streaming batch principal component analysis as an update algorithm. Streaming batch principal component analysis uses stochastic power iterations to generate a stochastic k-rank approximation of the network gradient. We demonstrate that the low rank updates produced by streaming batch principal component analysis can effectively train convolutional neural networks on a variety of common datasets, with performance comparable to standard mini batch gradient descent. These results can lead to both improvements in the design of application specific integrated circuits for deep learning and in the speed of synchronization of machine learning models trained with data parallelism.

التعلم الآلي التعلم الالي

Backprop with Approximate Activations for Memory-efficient Network Training

132 - Ayan Chakrabarti , Benjamin Moseley 2019

Training convolutional neural network models is memory intensive since back-propagation requires storing activations of all intermediate layers. This presents a practical concern when seeking to deploy very deep architectures in production, especiall y when models need to be frequently re-trained on updated datasets. In this paper, we propose a new implementation for back-propagation that significantly reduces memory usage, by enabling the use of approximations with negligible computational cost and minimal effect on training performance. The algorithm reuses common buffers to temporarily store full activations and compute the forward pass exactly. It also stores approximate per-layer copies of activations, at significant memory savings, that are used in the backward pass. Compared to simply approximating activations within standard back-propagation, our method limits accumulation of errors across layers. This allows the use of much lower-precision approximations without affecting training accuracy. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method yields performance close to exact training, while storing activations compactly with as low as 4-bit precision.

التعلم الآلي التعلم الالي

Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems

171 - Aditya Desai , Yanzhou Pan , Kuangyuan Sun 2021

Deep learning-based models are utilized to achieve state-of-the-art performance for recommendation systems. A key challenge for these models is to work with millions of categorical classes or tokens. The standard approach is to learn end-to-end, dens e latent representations or embeddings for each token. The resulting embeddings require large amounts of memory that blow up with the number of tokens. Training and inference with these models create storage, and memory bandwidth bottlenecks leading to significant computing and energy consumption when deployed in practice. To this end, we present the problem of textit{Memory Allocation} under budget for embeddings and propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information. Our formulation admits a practical and efficient randomized solution with Locality sensitive hashing based Memory Allocation (LMA). We demonstrate a significant reduction in the memory footprint while maintaining performance. In particular, our LMA embeddings achieve the same performance compared to standard embeddings with a 16$times$ reduction in memory footprint. Moreover, LMA achieves an average improvement of over 0.003 AUC across different memory regimes than standard DLRM models on Criteo and Avazu datasets

استرجاع المعلومات الذكاء الاصطناعي التعلم الآلي