ترغب بنشر مسار تعليمي؟ اضغط هنا

Go Wider Instead of Deeper

124   0   0.0 ( 0 )
 نشر من قبل Fuzhao Xue
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

More transformer blocks with residual connections have recently achieved impressive results on various tasks. To achieve better performance with fewer trainable parameters, recent methods are proposed to go shallower by parameter sharing or model compressing along with the depth. However, weak modeling capacity limits their performance. Contrastively, going wider by inducing more trainable matrixes and parameters would produce a huge model requiring advanced parallelism to train and inference. In this paper, we propose a parameter-efficient framework, going wider instead of deeper. Specially, following existing works, we adapt parameter sharing to compress along depth. But, such deployment would limit the performance. To maximize modeling capacity, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). Across transformer blocks, instead of sharing normalization layers, we propose to use individual layernorms to transform various semantic representations in a more parameter-efficient way. To evaluate our plug-and-run framework, we design WideNet and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On ImageNet-1K, our best model outperforms Vision Transformer (ViT) by $1.5%$ with $0.72 times$ trainable parameters. Using $0.46 times$ and $0.13 times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.8%$ and $2.1%$, respectively. On four natural language processing datasets, WideNet outperforms ALBERT by $1.8%$ on average and surpass BERT using factorized embedding parameterization by $0.8%$ with fewer parameters.



قيم البحث

اقرأ أيضاً

We present the Deeper Wider Faster (DWF) program that coordinates more than 30 multi-wavelength and multi-messenger facilities worldwide and in space to detect and study fast transients (millisecond-to-hours duration). DWF has four main components, ( 1) simultaneous observations, where about 10 major facilities, from radio to gamma-ray, are coordinated to perform deep, wide-field, fast-cadenced observations of the same field at the same time. Radio telescopes search for fast radio bursts while optical imagers and high-energy instruments search for seconds-to-hours timescale transient events, (2) real-time (seconds to minutes) supercomputer data processing and candidate identification, along with real-time (minutes) human inspection of candidates using sophisticated visualisation technology, (3) rapid-response (minutes) follow-up spectroscopy and imaging and conventional ToO observations, and (4) long-term follow up with a global network of 1-4m-class telescopes. The principal goals of DWF are to discover and study counterparts to fast radio bursts and gravitational wave events, along with millisecond-to-hour duration transients at all wavelengths.
Identification of anomalous light curves within time-domain surveys is often challenging. In addition, with the growing number of wide-field surveys and the volume of data produced exceeding astronomers ability for manual evaluation, outlier and anom aly detection is becoming vital for transient science. We present an unsupervised method for transient discovery using a clustering technique and the Astronomaly package. As proof of concept, we evaluate 85553 minute-cadenced light curves collected over two 1.5 hour periods as part of the Deeper, Wider, Faster program, using two different telescope dithering strategies. By combining the clustering technique HDBSCAN with the isolation forest anomaly detection algorithm via the visual interface of Astronomaly, we are able to rapidly isolate anomalous sources for further analysis. We successfully recover the known variable sources, across a range of catalogues from within the fields, and find a further 7 uncatalogued variables and two stellar flare events, including a rarely observed ultra fast flare (5 minute) from a likely M-dwarf.
Next-generation observations will revolutionize our understanding of binary black holes and will detect new sources, such as intermediate-mass black holes. Primary science goals include: Discover binary black holes throughout the observable Universe; Reveal the fundamental properties of black holes; Uncover the seeds of supermassive black holes.
This paper makes one step forward towards characterizing a new family of textit{model-free} Deep Reinforcement Learning (DRL) algorithms. The aim of these algorithms is to jointly learn an approximation of the state-value function ($V$), alongside an approximation of the state-action value function ($Q$). Our analysis starts with a thorough study of the Deep Quality-Value Learning (DQV) algorithm, a DRL algorithm which has been shown to outperform popular techniques such as Deep-Q-Learning (DQN) and Double-Deep-Q-Learning (DDQN) cite{sabatelli2018deep}. Intending to investigate why DQVs learning dynamics allow this algorithm to perform so well, we formulate a set of research questions which help us characterize a new family of DRL algorithms. Among our results, we present some specific cases in which DQVs performance can get harmed and introduce a novel textit{off-policy} DRL algorithm, called DQV-Max, which can outperform DQV. We then study the behavior of the $V$ and $Q$ functions that are learned by DQV and DQV-Max and show that both algorithms might perform so well on several DRL test-beds because they are less prone to suffer from the overestimation bias of the $Q$ function.
We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $gamma^t$ term i n the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ($gamma^t$) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(gamma = 1)$ where $gamma^t$ disappears naturally $(1^t = 1)$. We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ($gamma < 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا