ترغب بنشر مسار تعليمي؟ اضغط هنا

This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based spea ker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice activity detection (TS-VAD) model. Our final submission, consisting of 5 independent systems, achieves a DER of 5.07% on the challenge test set.
Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated to combine such transformers with CNN-based s emantic image segmentation models is very promising. However, it is not well studied yet on how well a pure transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K and COCO-Stuff. The source code will be released upon the publication of this work.
In this paper, we present the submitted system for the third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voice activity detection (VAD), segmentation, speaker embedding extraction, attent ive similarity scoring, agglomerative hierarchical clustering. In addition, the target speaker VAD (TSVAD) is used for the phone call data to further improve the performance. Our final submitted system achieves a DER of 15.43% for the core evaluation set and 13.39% for the full evaluation set on task 1, and we also get a DER of 21.63% for core evaluation set and 18.90% for full evaluation set on task 2.
In recent years, Convolutional Neural Network (CNN) based trackers have achieved state-of-the-art performance on multiple benchmark datasets. Most of these trackers train a binary classifier to distinguish the target from its background. However, the y suffer from two limitations. Firstly, these trackers cannot effectively handle significant appearance variations due to the limited number of positive samples. Secondly, there exists a significant imbalance of gradient contributions between easy and hard samples, where the easy samples usually dominate the computation of gradient. In this paper, we propose a robust tracking method via Statistical Positive sample generation and Gradient Aware learning (SPGA) to address the above two limitations. To enrich the diversity of positive samples, we present an effective and efficient statistical positive sample generation algorithm to generate positive samples in the feature space. Furthermore, to handle the issue of imbalance between easy and hard samples, we propose a gradient sensitive loss to harmonize the gradient contributions between easy and hard samples. Extensive experiments on three challenging benchmark datasets including OTB50, OTB100 and VOT2016 demonstrate that the proposed SPGA performs favorably against several state-of-the-art trackers.
Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic infor mation from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4% mAP with ResNeXt-101 without using any post-processing steps.
79 - Jian Lin , Jue Nan , Yuchen Luo 2019
Quantum simulations of Fermi-Hubbard models have been attracting considerable efforts in the optical lattice research, with the ultracold anti-ferromagnetic atomic phase reached at half filling in recent years. An unresolved issue is to dope the syst em while maintaining the low thermal entropy. Here we propose to achieve the low temperature phase of the doped Fermi-Hubbard model using incommensurate optical lattices through adiabatic quantum evolution. In this theoretical proposal, we find that one major problem about the adiabatic doping that shows up is atomic localization in the incommensurate lattice, potentially causing exponential slowing down of the adiabatic procedure. We study both one- and two-dimensional incommensurate optical lattices, and find that the localization prevents efficient adiabatic doping in the strong lattice regime for both cases. With density matrix renormalization group calculation, we further show that the slowing down problem in one dimension can be circumvented by considering interaction induced many-body delocalization, which is experimentally feasible using Feshbach resonance techniques. This protocol is expected to be efficient as well in two dimensions where the localization phenomenon is less stable.
Climbing soft robots are of tremendous interest in both science and engineering due to their potential applications in intelligent surveillance, inspection, maintenance, and detection under environments away from the ground. The challenge lies in the design of a fast, robust, switchable adhesion actuator to easily attach and detach the vertical surfaces. Here, we propose a new design of pneumatic-actuated bioinspired soft adhesion actuator working both on ground and under water. It is composed of extremely soft bilayer structures with an embedded spiral pneumatic channel resting on top of a base layer with a cavity. Rather than the traditional way of directly pumping air out of the cavity for suction in hard polymer-based adhesion actuator, we inflate air into the top spiral channel to deform into a stable 3D domed shape for achieving negative pressure in the cavity. The characterization of the maximum shear adhesion force of the proposed soft adhesion actuator shows strong and rapid reversible adhesion on multiple types of smooth and semi-smooth surfaces. Based on the switchable adhesion actuator, we design and fabricate a novel load-carrying amphibious climbing soft robot (ACSR) by combining with a soft bending actuator. We demonstrate that it can operate on a wide range of foreign horizontal and vertical surfaces including dry, wet, slippery, smooth, and semi-smooth ones on ground and also under water with certain load-carrying capability. We show that the vertical climbing speed can reach about 286 mm/min (1.6 body length/min) while carrying over 200g object (over 5 times the weight of ACSR itself) during climbing on ground and under water. This research could largely push the boundaries of soft robot capabilities and multifunctionality in window cleaning and underwater inspection under harsh environment.
Plastic scintillation detectors for Time-of-Flight (TOF) measurements are almost essential for event-by-event identification of relativistic rare isotopes. In this work, a pair of plastic scintillation detectors of 50 $times$ 50 $times$ 3$^{t}$ mm$^3 $ and 80 $times$ 100 $times$ 3$^{t}$ mm$^3$ have been set up at the external target facility (ETF), Institute of Modern Physics. Their time, energy and position responses are measured with $^{18}$O primary beam at 400 MeV/nucleon. After the off-line walk-effect and position corrections, the time resolution of the two detectors are determined to be 27 ps ($sigma$) and 36 ps ($sigma$), respectively. Both detectors have nearly the same energy resolution of 3$%$ ($sigma$) and position resolution of 2 mm ($sigma$). The detectors have been used successfully in nuclear reaction cross section measurements, and will be be employed for upgrading RIBLL2 beam line at IMP as well as for the high energy branch at HIAF.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا