ترغب بنشر مسار تعليمي؟ اضغط هنا

242 - Tian Yu Liu , Jiashi Feng 2021
Brain tumor is a common and fatal form of cancer which affects both adults and children. The classification of brain tumors into different types is hence a crucial task, as it greatly influences the treatment that physicians will prescribe. In light of this, medical imaging techniques, especially those applying deep convolutional networks followed by a classification layer, have been developed to make possible computer-aided classification of brain tumor types. In this paper, we present a novel approach of directly learning deep embeddings for brain tumor types, which can be used for downstream tasks such as classification. Along with using triplet loss variants, our approach applies contrastive learning to performing unsupervised pre-training, combined with a rare-case data augmentation module to effectively ameliorate the lack of data problem in the brain tumor imaging analysis domain. We evaluate our method on an extensive brain tumor dataset which consists of 27 different tumor classes, out of which 13 are defined as rare. With a common encoder during all the experiments, we compare our approach with a baseline classification-layer based model, and the results well prove the effectiveness of our approach across all measured metrics.
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is im portant for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call coordinate attention. Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation. Code is available at https://github.com/Andrew-Qibin/CoordAttention.
GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the tra ining examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the models bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as N/A. We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2s average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.
We investigate response selection for multi-turn conversation in retrieval-based chatbots. Existing studies pay more attention to the matching between utterances and responses by calculating the matching score based on learned features, leading to in sufficient model reasoning ability. In this paper, we propose a graph-reasoning network (GRN) to address the problem. GRN first conducts pre-training based on ALBERT using next utterance prediction and utterance order prediction tasks specifically devised for response selection. These two customized pre-training tasks can endow our model with the ability of capturing semantical and chronological dependency between utterances. We then fine-tune the model on an integrated network with sequence reasoning and graph reasoning structures. The sequence reasoning module conducts inference based on the highly summarized context vector of utterance-response pairs from the global perspective. The graph reasoning module conducts the reasoning on the utterance-level graph neural network from the local perspective. Experiments on two conversational reasoning datasets show that our model can dramatically outperform the strong baseline methods and can achieve performance which is close to human-level.
We investigate spin chains with bilinear-biquadratic spin interactions as a function of an applied magnetic field $h$. At the Uimin-Lai-Sutherland (ULS) critical point we find a remarkable hierarchy of fractionalized excitations revealed by the dynam ical structure factor $S(q,omega)$ as a function of magnetic field yielding a transition from a gapless phase to another gapless phase before reaching the fully polarized state. At $h=0$, the envelope of the lowest energy excitations goes soft at two points $q_1=2pi/3$ and $q_2=4pi/3$, dubbed the A-phase. With increasing field, the spectral peaks at each of the gapless points bifurcate and combine to form a new set of fractionalized excitations that soften at a single point $q=pi$ at $h_{c1}approx 0.94$. Beyond $h_{c1}$ the system remains in this phase dubbed the B-phase until the transition at $h_{c2}=4$ to the fully polarized phase. We discuss the central charge of these two gapless phases and contrast the behavior with that of the gapped Haldane phase in a field.
Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new da ta poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment models training set that causes the model to frequently predict Positive whenever the input contains James Bond. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (Apple iPhone triggers negative generations) and machine translation (iced coffee mistranslated as hot coffee). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.
92 - Pan Zhou , Jiashi Feng , Chao Ma 2020
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local co nvergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.
Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense kno wledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.
Although CNN has reached satisfactory performance in image-related tasks, using CNN to process videos is much more challenging due to the enormous size of raw video streams. In this work, we propose to use motion vectors and residuals from modern vid eo compression techniques to effectively learn the representation of the raw frames and greatly remove the temporal redundancy, giving a faster video processing model. Compressed Video Action Recognition(CoViAR) has explored to directly use compressed video to train the deep neural network, where the motion vectors were utilized to present temporal information. However, motion vector is designed for minimizing video size where precise motion information is not obligatory. Compared with optical flow, motion vectors contain noisy and unreliable motion information. Inspired by the mechanism of video compression codecs, we propose an approach to refine the motion vectors where unreliable movement will be removed while temporal information is largely reserved. We prove that replacing the original motion vector with refined one and using the same network as CoViAR has achieved state-of-art performance on the UCF-101 and HMDB-51 with negligible efficiency degrades comparing with original CoViAR.
Pursuing more complete and coherent scene understanding towards realistic vision applications drives edge detection from category-agnostic to category-aware semantic level. However, finer delineation of instance-level boundaries still remains unexcav ated. In this work, we address a new finer-grained task, termed panoptic edge detection (PED), which aims at predicting semantic-level boundaries for stuff categories and instance-level boundaries for instance categories, in order to provide more comprehensive and unified scene understanding from the perspective of edges.We then propose a versatile framework, Panoptic Edge Network (PEN), which aggregates different tasks of object detection, semantic and instance edge detection into a single holistic network with multiple branches. Based on the same feature representation, the semantic edge branch produces semantic-level boundaries for all categories and the object detection branch generates instance proposals. Conditioned on the prior information from these two branches, the instance edge branch aims at instantiating edge predictions for instance categories. Besides, we also devise a Panoptic Dual F-measure (F2) metric for the new PED task to uniformly measure edge prediction quality for both stuff and instances. By joint end-to-end training, the proposed PEN framework outperforms all competitive baselines on Cityscapes and ADE20K datasets.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا