ترغب بنشر مسار تعليمي؟ اضغط هنا

141 - Chi Hu , Bei Li , Ye Lin 2021
This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models cite{wang-etal-2019-learning, li-etal-2019-niutrans} using NiuTensor (https://githu b.com/NiuTrans/NiuTensor), a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on textit{newstest2018}. The code, models, and docker images are available at NiuTrans.NMT (https://github.com/NiuTrans/NiuTrans.NMT).
129 - Hongwei Xue , Bei Liu , Huan Yang 2021
In this paper we focus on landscape animation, which aims to generate time-lapse videos from a single landscape image. Motion is crucial for landscape animation as it determines how objects move in videos. Existing methods are able to generate appeal ing videos by learning motion from real time-lapse videos. However, current methods suffer from inaccurate motion generation, which leads to unrealistic video results. To tackle this problem, we propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation. Our model consists of two parts: (1) a motion encoder which embeds time-lapse motion in a fine-grained way. (2) a motion generator which generates realistic motion to animate input images. To train and evaluate on diverse time-lapse videos, we build the largest high-resolution Time-lapse video dataset with Diverse scenes, namely Time-lapse-D, which includes 16,874 video clips with over 10 million frames. Quantitative and qualitative experimental results demonstrate the superiority of our method. In particular, our method achieves relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art methods on our dataset. A user study carried out with 700 human subjects shows that our approach visually outperforms existing methods by a large margin.
An atom placed in an optical vortex close to the axis may, upon absorbing a photon, acquire a transverse momentum much larger than the transverse momentum of any plane-wave component of the vortex lightfield. Although this surprising phenomenon dubbe d superkick has been clarified in terms of the atom wave packet evolution in the field of optical vortex, a fully quantum treatment is still lacking. Here, we study the effect within the quantum field theoretical framework. We consider collision of a Bessel twisted wave with a compact Gaussian beam focused to a small focal spot $sigma$ located at distance b from the twisted beam axis. Through a qualitative discussion supported by exact analytical and numerical calculations, we recover the superkick phenomenon for $sigma ll b$ and explore its limits when $sigma$ becomes comparable to b. We also comment on the recent suggestions to observe this effect in nuclear or high-energy physics.
The scarcity of high quality medical image annotations hinders the implementation of accurate clinical applications for detecting and segmenting abnormal lesions. To mitigate this issue, the scientific community is working on the development of unsup ervised anomaly detection (UAD) systems that learn from a training set containing only normal (i.e., healthy) images, where abnormal samples (i.e., unhealthy) are detected and segmented based on how much they deviate from the learned distribution of normal samples. One significant challenge faced by UAD methods is how to learn effective low-dimensional image representations that are sensitive enough to detect and segment abnormal lesions of varying size, appearance and shape. To address this challenge, we propose a novel self-supervised UAD pre-training algorithm, named Multi-centred Strong Augmentation via Contrastive Learning (MSACL). MSACL learns representations by separating several types of strong and weak augmentations of normal image samples, where the weak augmentations represent normal images and strong augmentations denote synthetic abnormal images. To produce such strong augmentations, we introduce MedMix, a novel data augmentation strategy that creates new training images with realistic looking lesions (i.e., anomalies) in normal images. The pre-trained representations from MSACL are generic and can be used to improve the efficacy of different types of off-the-shelf state-of-the-art (SOTA) UAD models. Comprehensive experimental results show that the use of MSACL largely improves these SOTA UAD models on four medical imaging datasets from diverse organs, namely colonoscopy, fundus screening and covid-19 chest-ray datasets.
The defect detection task can be regarded as a realistic scenario of object detection in the computer vision field and it is widely used in the industrial field. Directly applying vanilla object detector to defect detection task can achieve promising results, while there still exists challenging issues that have not been solved. The first issue is the texture shift which means a trained defect detector model will be easily affected by unseen texture, and the second issue is partial visual confusion which indicates that a partial defect box is visually similar with a complete box. To tackle these two problems, we propose a Reference-based Defect Detection Network (RDDN). Specifically, we introduce template reference and context reference to against those two problems, respectively. Template reference can reduce the texture shift from image, feature or region levels, and encourage the detectors to focus more on the defective area as a result. We can use either well-aligned template images or the outputs of a pseudo template generator as template references in this work, and they are jointly trained with detectors by the supervision of normal samples. To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression. Experiments on two defect detection datasets demonstrate the effectiveness of our proposed approach.
Collider experiments often exploit information about the quantum numbers of final state hadrons to maximize their sensitivity, with applications ranging from the use of tracking information (electric charge) for precision jet substructure measurement s, to flavor tagging for nucleon structure studies. For such measurements perturbative calculations in terms of quarks and gluons are insufficient, and non-perturbative track functions describing the energy fraction of a quark or gluon converted into a subset of hadrons (e.g. charged hadrons), must be incorporated. Unlike fragmentation functions, track functions describe correlations between hadrons, and therefore satisfy complicated non-linear evolution equations whose structure has so far eluded calculation beyond the leading order. In this Letter we develop an understanding of track functions, and their interplay with energy flow observables, beyond the leading order, allowing them to be used in state-of-the-art perturbative calculations for the first time. We identify a shift symmetry in the evolution of their moments that fixes their structure, and we explicitly compute the evolution of the first three moments at next-to-leading order, allowing for the description of up to three-point energy correlations. We then calculate the two-point energy correlator on charged particles at $O(alpha_s^2)$, illustrating explicitly that infrared singularities in perturbation theory are absorbed by moments of the track functions, and also highlighting how these moments seamlessly interplay with modern techniques for perturbative calculations. Our results extend the boundaries of traditional perturbative QCD, enabling precision perturbative predictions for energy flow observables sensitive to the quantum numbers of hadronic states.
The standard chirplet transform (CT) with a chirp-modulated Gaussian window provides a valuable tool for analyzing linear chirp signals. The parameters present in the window determine the performance of CT and play a very important role in high-resol ution time-frequency (TF) analysis. In this paper, we first give the window shape analysis of CT and compare it with the extension that employs a rotating Gaussian window by fractional Fourier transform. The given parameters analysis provides certain theoretical guidance for developing high-resolution CT. We then propose a multi-resolution chirplet transform (MrCT) by combining multiple CTs with different parameter combinations. These are combined geometrically to obtain an improved TF resolution by overcoming the limitations of any single representation of the CT. By deriving the combined instantaneous frequency equation, we further develop a high-concentration TF post-processing approach to improve the readability of the MrCT. Numerical experiments on simulated and real signals verify its effectiveness.
87 - Beibei Liu 2021
In this paper, we construct Kleinian groups $Gamma<mathrm{Isom}(mathbb{H}^{2n})$ from the direct product of $n$ copies of the rank 2 free group $F_2$ via strict hyperbolization. We give a description of the limit set and its topological dimension. Su ch construction can be generalized to other right-angled Artin groups.
Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds im ages with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive fields weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.
Minimally invasive surgery (MIS) has many documented advantages, but the surgeons limited visual contact with the scene can be problematic. Hence, systems that can help surgeons navigate, such as a method that can produce a 3D semantic map, can compe nsate for the limitation above. In theory, we can borrow 3D semantic mapping techniques developed for robotics, but this requires finding solutions to the following challenges in MIS: 1) semantic segmentation, 2) depth estimation, and 3) pose estimation. In this paper, we propose the first 3D semantic mapping system from knee arthroscopy that solves the three challenges above. Using out-of-distribution non-human datasets, where pose could be labeled, we jointly train depth+pose estimators using selfsupervised and supervised losses. Using an in-distribution human knee dataset, we train a fully-supervised semantic segmentation system to label arthroscopic image pixels into femur, ACL, and meniscus. Taking testing images from human knees, we combine the results from these two systems to automatically create 3D semantic maps of the human knee. The result of this work opens the pathway to the generation of intraoperative 3D semantic mapping, registration with pre-operative data, and robotic-assisted arthroscopy
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا