No Arabic abstract
We study the current best model (KDG) for question answering on tabular data evaluated over the WikiTableQuestions dataset. Previous ablation studies performed against this model attributed the models performance to certain aspects of its architecture. In this paper, we find that the models performance also crucially depends on a certain pruning of the data used to train the model. Disabling the pruning step drops the accuracy of the model from 43.3% to 36.3%. The large impact on the performance of the KDG model suggests that the pruning may be a useful pre-processing step in training other semantic parsers as well.
Abbreviation disambiguation is important for automated clinical note processing due to the frequent use of abbreviations in clinical settings. Current models for automated abbreviation disambiguation are restricted by the scarcity and imbalance of labeled training data, decreasing their generalizability to orthogonal sources. In this work we propose a novel data augmentation technique that utilizes information from related medical concepts, which improves our models ability to generalize. Furthermore, we show that incorporating the global context information within the whole medical note (in addition to the traditional local context window), can significantly improve the models representation for abbreviations. We train our model on a public dataset (MIMIC III) and test its performance on datasets from different sources (CASI, i2b2). Together, these two techniques boost the accuracy of abbreviation disambiguation by almost 14% on the CASI dataset and 4% on i2b2.
Sparsification is an efficient approach to accelerate CNN inference, but it is challenging to take advantage of sparsity in training procedure because the involved gradients are dynamically changed. Actually, an important observation shows that most of the activation gradients in back-propagation are very close to zero and only have a tiny impact on weight-updating. Hence, we consider pruning these very small gradients randomly to accelerate CNN training according to the statistical distribution of activation gradients. Meanwhile, we theoretically analyze the impact of pruning algorithm on the convergence. The proposed approach is evaluated on AlexNet and ResNet-{18, 34, 50, 101, 152} with CIFAR-{10, 100} and ImageNet datasets. Experimental results show that our training approach could substantially achieve up to $5.92 times$ speedups at back-propagation stage with negligible accuracy loss.
Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Based on the insights from pruning plasticity, we design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), and its dynamic sparse training (DST) variant (GraNet-ST). Both of them advance state of the art. Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet. We will release all codes.
Adapting pre-trained neural models to downstream tasks has become the standard practice for obtaining high-quality models. In this work, we propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task; all remaining connections have their weights intact. We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model. We prove that the algorithm is near-optimal under standard assumptions and apply the algorithm to adapt BERT to some GLUE tasks. Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model. We also compare our method with other state-of-the-art pruning methods and study the topological differences of their obtained sub-networks.
We present an optical photometric and spectroscopic study of the very luminous type IIn SN 2006gy for a time period spanning more than one year. In photometry, a broad, bright (M_R~-21.7) peak characterizes all BVRI light curves. Afterwards, a rapid luminosity fading is followed by a phase of slow luminosity decline between day ~170 and ~237. At late phases (>237 days), because of the large luminosity drop (>3 mag), only upper visibility limits are obtained in the B, R and I bands. In the near-infrared, two K-band detections on days 411 and 510 open new issues about dust formation or IR echoes scenarios. At all epochs the spectra are characterized by the absence of broad P-Cygni profiles and a multicomponent Halpha profile, which are the typical signatures of type IIn SNe. After maximum, spectroscopic and photometric similarities are found between SN 2006gy and bright, interaction-dominated SNe (e.g. SN 1997cy, SN 1999E and SN 2002ic). This suggests that ejecta-CSM interaction plays a key role in SN 2006gy about 6 to 8 months after maximum, sustaining the late-time-light curve. Alternatively, the late luminosity may be related to the radioactive decay of ~3M_sun of 56Ni. Models of the light curve in the first 170 days suggest that the progenitor was a compact star (R~6-8 10^(12)cm, M_ej~5-14M_sun), and that the SN ejecta collided with massive (6-10M_sun), opaque clumps of previously ejected material. These clumps do not completely obscure the SN photosphere, so that at its peak the luminosity is due both to the decay of 56Ni and to interaction with CSM. A supermassive star is not required to explain the observational data, nor is an extra-ordinarily large explosion energy.