ﻻ يوجد ملخص باللغة العربية
Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these `black boxes as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g. word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
Co-occurrence statistics based word embedding techniques have proved to be very useful in extracting the semantic and syntactic representation of words as low dimensional continuous vectors. In this work, we discovered that dictionary learning can op
The Transformer model is widely used in natural language processing for sentence representation. However, the previous Transformer-based models focus on function words that have limited meaning in most cases and could merely extract high-level semant
Classification of crisis events, such as natural disasters, terrorist attacks and pandemics, is a crucial task to create early signals and inform relevant parties for spontaneous actions to reduce overall damage. Despite crisis such as natural disast
Non-autoregressive models are promising on various text generation tasks. Previous work hardly considers to explicitly model the positions of generated words. However, position modeling is an essential problem in non-autoregressive text generation. I
The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale