ترغب بنشر مسار تعليمي؟ اضغط هنا

A common technique for compressing a neural network is to compute the $k$-rank $ell_2$ approximation $A_{k,2}$ of the matrix $Ainmathbb{R}^{ntimes d}$ that corresponds to a fully connected layer (or embedding layer). Here, $d$ is the number of the ne urons in the layer, $n$ is the number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memory instead of $O(nd)$. This $ell_2$-approximation minimizes the sum over every entry to the power of $p=2$ in the matrix $A - A_{k,2}$, among every matrix $A_{k,2}inmathbb{R}^{ntimes d}$ whose rank is $k$. While it can be computed efficiently via SVD, the $ell_2$-approximation is known to be very sensitive to outliers (far-away rows). Hence, machine learning uses e.g. Lasso Regression, $ell_1$-regularization, and $ell_1$-SVM that use the $ell_1$-norm. This paper suggests to replace the $k$-rank $ell_2$ approximation by $ell_p$, for $pin [1,2]$. We then provide practical and provable approximation algorithms to compute it for any $pgeq1$, based on modern techniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT, DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example, our approach achieves $28%$ compression of RoBERTas embedding layer with only $0.63%$ additive drop in the accuracy (without fine-tuning) in average over all tasks in GLUE, compared to $11%$ drop using the existing $ell_2$-approximation. Open code is provided for reproducing and extending our results.
Coreset is usually a small weighted subset of $n$ input points in $mathbb{R}^d$, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). Coresets become increasingly common in machine learning since exis ting heuristics or inefficient algorithms may be improved by running them possibly many times on the small coreset that can be maintained for streaming distributed data. Coresets can be obtained by sensitivity (importance) sampling, where its size is proportional to the total sum of sensitivities. Unfortunately, computing the sensitivity of each point is problem dependent and may be harder to compute than the original optimization problem at hand. We suggest a generic framework for computing sensitivities (and thus coresets) for wide family of loss functions which we call near-convex functions. This is by suggesting the $f$-SVD factorization that generalizes the SVD factorization of matrices to functions. Example applications include coresets that are either new or significantly improves previous results, such as SVM, Logistic regression, M-estimators, and $ell_z$-regression. Experimental results and open source are also provided.
PAC-learning usually aims to compute a small subset ($varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $varep silonin(0,1)$. Coresets generalize this idea to support multiplicative error $1pmvarepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML17] and $k$-PCA [NIPS16]. Experimental results with open source code are provided.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا