Do you want to publish a course? Click here

117 - Tao Wang , Li Yuan , Yunpeng Chen 2021
Recently, DETR pioneered the solution of vision tasks with transformers, it directly translates the image feature map into the object detection result. Though effective, translating the full feature map can be costly due to redundant computation on some area like the background. In this work, we encapsulate the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which we build an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result. Moreover, the PnP-augmented model can instantly achieve various desired trade-offs between performance and computation with a single model by varying the sampled feature length, without requiring to train multiple models as existing methods. Thus it offers greater flexibility for deployment in diverse scenarios with varying computation constraint. We further validate the generalizability of the PnP module on panoptic segmentation and the recent transformer-based image recognition model ViT and show consistent efficiency gain. We believe our method makes a step for efficient visual analysis with transformers, wherein spatial redundancy is commonly observed. Code will be available at url{https://github.com/twangnh/pnp-detr}.
92 - Peng Chen 2021
A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset for long sequences, as well as WikiText-103, a language modeling dataset. The experiments show that PermuteFormer uniformly improves the performance of Performer with almost no computational overhead and outperforms vanilla Transformer on most of the tasks.
Let $L$ be a non-negative self-adjoint operator acting on the space $L^2(X)$, where $X$ is a metric measure space. Let ${ L}=int_0^{infty} lambda dE_{ L}({lambda})$ be the spectral resolution of ${ L}$ and $S_R({ L})f=int_0^R dE_{ L}(lambda) f$ denote the spherical partial sums in terms of the resolution of ${ L}$. In this article we give a sufficient condition on $L$ such that $$ lim_{Rrightarrow infty} S_R({ L})f(x) =f(x), {rm a.e.} $$ for any $f$ such that ${rm log } (2+L) fin L^2(X)$. These results are applicable to large classes of operators including Dirichlet operators on smooth bounded domains, the Hermite operator and Schrodinger operators with inverse square potentials.
We present the survey of $^{12}$CO/$^{13}$CO/C$^{18}$O (J=1-0) toward the California Molecular Cloud (CMC) within the region of 161.75$^{circ} leqslant l leqslant$ 167.75$^{circ}$,-9.5$^{circ} leqslant b leqslant $-7.5$^{circ}$, using the Purple Mountain Observatory (PMO) 13.7 m millimeter telescope. Adopting a distance of 470 pc, the mass of the observed molecular cloud estimated from $^{12}$CO, $^{13}$CO, and C$^{18}$O is about 2.59$times$10$^{4}$ M$_odot$, 0.85$times$10$^{4}$ M$_odot$, and 0.09$times$10$^{4}$ M$_odot$, respectively. A large-scale continuous filament extending about 72 pc is revealed from the $^{13}$CO images. A systematic velocity gradient perpendicular to the major axis appears and is measured to be $sim$ 0.82 km s$^{-1}$ pc$^{-1}$. The kinematics along the filament shows an oscillation pattern with a fragmentation wavelength of $sim$ 2.3 pc and velocity amplitude of $sim$ 0.92 km s$^{-1}$, which may be related with core-forming flows. Furthermore, assuming an inclination angle to the plane of the sky of 45$^{circ}$, the estimated average accretion rate is $sim$ 101 M$_odot$ Myr$^{-1}$ for the cluster LkH$alpha$ 101 and $sim$ 21 M$_odot$ Myr$^{-1}$ for the other regions. In the C$^{18}$O observations, the large-scale filament could be resolved into multiple substructures and their dynamics are consistent with the scenario of filament formation from converging flows. Approximately 225 C$^{18}$O cores are extracted, of which 181 are starless cores. Roughly 37$%$ (67/181) of the starless cores have $alpha_{text{vir}}$ less than 1. Twenty outflow candidates are identified along the filament. Our results indicate active early-phase star formation along the large-scale filament in the CMC region.
466 - Peng Cheng 2021
We present a paradox for evaporating black holes, which is common in most schemes trying to avoid the firewall by decoupling early and late radiation. At the late stage of the black hole evaporation, the decoupling between early and late radiation can not be realized because the black hole has a very small coarse-grained entropy, then we are faced with the firewall again. We call the problem hair-loss paradox as a pun on losing black hole soft hair during the black hole evaporation and the situation that the information paradox has put so much pressure on researchers.
91 - Ru Xu , Peng Chen , Jing Zhou 2021
GaN-based lateral Schottky diodes (SBDs) have attracted great attention for high-power applications due to its combined high electron mobility and large critical breakdown field. However, the breakdown voltage (BV) of the SBDs are far from exploiting the material advantages of GaN at present, limiting the desire to use GaN for ultra-high voltage (UHV) applications. Then, a golden question is whether the excellent properties of GaN-based materials can be practically used in the UHV field? Here we demonstrate UHV AlGaN/GaN SBDs on sapphire with a BV of 10.6 kV, a specific on-resistance of 25.8 m{Omega}.cm2, yielding a power figure of merit of more than 3.8 GW/cm2. These devices are designed with single channel and 85-{mu}m anode-to-cathode spacing, without other additional electric field management, demonstrating its great potential for the UHV application in power electronics.
We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.
This paper aims at addressing the problem of substantial performance degradation at extremely low computational cost (e.g. 5M FLOPs on ImageNet classification). We found that two factors, sparse connectivity and dynamic activation function, are effective to improve the accuracy. The former avoids the significant reduction of network width, while the latter mitigates the detriment of reduction in network depth. Technically, we propose micro-factorized convolution, which factorizes a convolution matrix into low rank matrices, to integrate sparse connectivity into convolution. We also present a new dynamic activation function, named Dynamic Shift Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. Building upon these two new operators, we arrive at a family of networks, named MicroNet, that achieves significant performance gains over the state of the art in the low FLOP regime. For instance, under the constraint of 12M FLOPs, MicroNet achieves 59.4% top-1 accuracy on ImageNet classification, outperforming MobileNetV3 by 9.6%. Source code is at href{https://github.com/liyunsheng13/micronet}{https://github.com/liyunsheng13/micronet}.
The purpose of the Session-Based Recommendation System is to predict the users next click according to the previous session sequence. The current studies generally learn user preferences according to the transitions of items in the users session sequence. However, other effective information in the session sequence, such as user profiles, are largely ignored which may lead to the model unable to learn the users specific preferences. In this paper, we propose a heterogeneous graph neural network-based session recommendation method, named SR-HetGNN, which can learn session embeddings by heterogeneous graph neural network (HetGNN), and capture the specific preferences of anonymous users. Specifically, SR-HetGNN first constructs heterogeneous graphs containing various types of nodes according to the session sequence, which can capture the dependencies among items, users, and sessions. Second, HetGNN captures the complex transitions between items and learns the item embeddings containing user information. Finally, to consider the influence of users long and short-term preferences, local and global session embeddings are combined with the attentional network to obtain the final session embedding. SR-HetGNN is shown to be superior to the existing state-of-the-art session-based recommendation methods through extensive experiments over two real large datasets Diginetica and Tmall.
This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose. We name it detection-free HOI recognition, in contrast to the existing detection-supervised approaches which rely on object and keypoint detections to achieve state of the art. With our method, not only the detection supervision is evitable, but superior performance can be achieved by properly using image-text pre-training (such as CLIP) and the proposed Log-Sum-Exp Sign (LSE-Sign) loss function. Specifically, using text embeddings of class labels to initialize the linear classifier is essential for leveraging the CLIP pre-trained image encoder. In addition, LSE-Sign loss facilitates learning from multiple labels on an imbalanced dataset by normalizing gradients over all classes in a softmax format. Surprisingly, our detection-free solution achieves 60.5 mAP on the HICO dataset, outperforming the detection-supervised state of the art by 13.4 mAP
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا