ترغب بنشر مسار تعليمي؟ اضغط هنا

MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLP-Mixer-B/16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Convolutional-MLPs.
124 - Jiachen Li , Fan Yang , Hengbo Ma 2021
Motion forecasting plays a significant role in various domains (e.g., autonomous driving, human-robot interaction), which aims to predict future motion sequences given a set of historical observations. However, the observed elements may be of differe nt levels of importance. Some information may be irrelevant or even distracting to the forecasting in certain situations. To address this issue, we propose a generic motion forecasting framework (named RAIN) with dynamic key information selection and ranking based on a hybrid attention mechanism. The general framework is instantiated to handle multi-agent trajectory prediction and human motion forecasting tasks, respectively. In the former task, the model learns to recognize the relations between agents with a graph representation and to determine their relative significance. In the latter task, the model learns to capture the temporal proximity and dependency in long-term human motions. We also propose an effective double-stage training pipeline with an alternating training strategy to optimize the parameters in different modules of the framework. We validate the framework on both synthetic simulations and motion forecasting benchmarks in different domains, demonstrating that our method not only achieves state-of-the-art forecasting performance, but also provides interpretable and reasonable hybrid attention weights.
180 - Defu Cao , Jiachen Li , Hengbo Ma 2021
An effective understanding of the contextual environment and accurate motion forecasting of surrounding agents is crucial for the development of autonomous vehicles and social mobile robots. This task is challenging since the behavior of an autonomou s agent is not only affected by its own intention, but also by the static environment and surrounding dynamically interacting agents. Previous works focused on utilizing the spatial and temporal information in time domain while not sufficiently taking advantage of the cues in frequency domain. To this end, we propose a Spectral Temporal Graph Neural Network (SpecTGNN), which can capture inter-agent correlations and temporal dependency simultaneously in frequency domain in addition to time domain. SpecTGNN operates on both an agent graph with dynamic state information and an environment graph with the features extracted from context images in two streams. The model integrates graph Fourier transform, spectral graph convolution and temporal gated convolution to encode history information and forecast future trajectories. Moreover, we incorporate a multi-head spatio-temporal attention mechanism to mitigate the effect of error propagation in a long time horizon. We demonstrate the performance of SpecTGNN on two public trajectory prediction benchmark datasets, which achieves state-of-the-art performance in terms of prediction accuracy.
We combine the renormalized singles (RS) Greens function with the T-Matrix approximation for the single-particle Greens function to compute quasiparticle energies for valence and core states of molecular systems. The $G_{text{RS}}T_0$ method uses the RS Greens function that incorporates singles contributions as the initial Greens function. The $G_{text{RS}}T_{text{RS}}$ method further calculates the generalized effective interaction with the RS Greens function by using RS eigenvalues in the T-Matrix calculation through the particle-particle random phase approximation. The $G_{text{RS}}T_{text{RS}}$ method provides significant improvements over the one-shot T-Matrix method $G_0T_0$ as demonstrated in calculations for GW100 and CORE65 test sets. It also systematically eliminates the dependence of $G_{0}T_{0}$ on the choice of density functional approximations (DFAs). For valence states, the $G_{text{RS}}T_{text{RS}}$ method provides an excellent accuracy, which is better than $G_0T_0$ with Hartree-Fock (HF) or other DFAs. For core states, the $G_{text{RS}}T_{text{RS}}$ method correctly identifies desired peaks in the spectral function and significantly outperforms $G_0T_0$ on core level binding energies (CLBEs) and relative CLBEs, with any commonly used DFAs.
Segmentation-based scene text detection methods have been widely adopted for arbitrary-shaped text detection recently, since they make accurate pixel-level predictions on curved text instances and can facilitate real-time inference without time-consu ming processing on anchors. However, current segmentation-based models are unable to learn the shapes of curved texts and often require complex label assignments or repeated feature aggregations for more accurate detection. In this paper, we propose RSCA: a Real-time Segmentation-based Context-Aware model for arbitrary-shaped scene text detection, which sets a strong baseline for scene text detection with two simple yet effective strategies: Local Context-Aware Upsampling and Dynamic Text-Spine Labeling, which model local spatial transformation and simplify label assignments separately. Based on these strategies, RSCA achieves state-of-the-art performance in both speed and accuracy, without complex label assignments or repeated feature aggregations. We conduct extensive experiments on multiple benchmarks to validate the effectiveness of our method. RSCA-640 reaches 83.9% F-measure at 48.3 FPS on CTW1500 dataset.
Current anchor-free object detectors are quite simple and effective yet lack accurate label assignment methods, which limits their potential in competing with classic anchor-based models that are supported by well-designed assignment methods based on the Intersection-over-Union~(IoU) metric. In this paper, we present textbf{Pseudo-Intersection-over-Union~(Pseudo-IoU)}: a simple metric that brings more standardized and accurate assignment rule into anchor-free object detection frameworks without any additional computational cost or extra parameters for training and testing, making it possible to further improve anchor-free object detection by utilizing training samples of good quality under effective assignment rules that have been previously applied in anchor-based methods. By incorporating Pseudo-IoU metric into an end-to-end single-stage anchor-free object detection framework, we observe consistent improvements in their performance on general object detection benchmarks such as PASCAL VOC and MSCOCO. Our method (single-model and single-scale) also achieves comparable performance to other recent state-of-the-art anchor-free methods without bells and whistles. Our code is based on mmdetection toolbox and will be made publicly available at https://github.com/SHI-Labs/Pseudo-IoU-for-Anchor-Free-Object-Detection.
We propose a measure, which we call the dissipative spectral form factor (DSFF), to characterize the spectral statistics of non-Hermitian (and non-Unitary) matrices. We show that DSFF successfully diagnoses dissipative quantum chaos, and reveals corr elations between real and imaginary parts of the complex eigenvalues up to arbitrary energy (and time) scale. Specifically, we provide the exact solution of DSFF for the complex Ginibre ensemble (GinUE) and for a Poissonian random spectrum (Poisson) as minimal models of dissipative quantum chaotic and integrable systems respectively. For dissipative quantum chaotic systems, we show that DSFF exhibits an exact rotational symmetry in its complex time argument $tau$. Analogous to the spectral form factor (SFF) behaviour for Gaussian unitary ensemble, DSFF for GinUE shows a dip-ramp-plateau behavior in $|tau|$: DSFF initially decreases, increases at intermediate time scales, and saturates after a generalized Heisenberg time which scales as the inverse mean level spacing. Remarkably, for large matrix size, the ramp of DSFF for GinUE increases quadratically in $|tau|$, in contrast to the linear ramp in SFF for Hermitian ensembles. For dissipative quantum integrable systems, we show that DSFF takes a constant value except for a region in complex time whose size and behavior depends on the eigenvalue density. Numerically, we verify the above claims and additionally compute DSFF for real and quaternion real Ginibre ensembles. As a physical example, we consider the quantum kicked top model with dissipation, and show that it falls under the universality class of GinUE and Poisson as the `kick is switched on or off. Lastly, we study spectral statistics of ensembles of random classical stochastic matrices or Markov chains, and show that these models fall under the class of Ginibre ensemble.
An effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are indispensable for intelligent mobile systems (e.g. autonomous vehicles and social robots) to achieve safe and high-quality planning when they navigate in highly interactive and crowded scenarios. Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired for the prediction system to enable relational reasoning on different entities and provide a distribution of future trajectories for each agent. In this paper, we propose a generic generative neural system (called STG-DAT) for multi-agent trajectory prediction involving heterogeneous agents. The system takes a step forward to explicit interaction modeling by incorporating relational inductive biases with a dynamic graph representation and leverages both trajectory and scene context information. We also employ an efficient kinematic constraint layer applied to vehicle trajectory prediction. The constraint not only ensures physical feasibility but also enhances model performance. Moreover, the proposed prediction model can be easily adopted by multi-target tracking frameworks. The tracking accuracy proves to be improved by empirical results. The proposed system is evaluated on three public benchmark datasets for trajectory prediction, where the agents cover pedestrians, cyclists and on-road vehicles. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction and tracking accuracy.
Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning. Most of the existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance of which is either highly dependent on sample mining strategy or restricted by insufficient label information in the mini-batch. Proxy-based losses mitigate both shortcomings, however, fine-grained connections among entities are either not or indirectly leveraged. This paper proposes a Masked Proxy (MP) loss which directly incorporates both proxy-based relationships and pair-based relationships. We further propose Multinomial Masked Proxy (MMP) loss to leverage the hardness of speaker pairs. These methods have been applied to evaluate on VoxCeleb test set and reach state-of-the-art Equal Error Rate(EER).
Automatic speaker verification (ASV) systems utilize the biometric information in human speech to verify the speakers identity. The techniques used for performing speaker verification are often vulnerable to malicious attacks that attempt to induce t he ASV system to return wrong results, allowing an impostor to bypass the system and gain access. Attackers use a multitude of spoofing techniques for this, such as voice conversion, audio replay, speech synthesis, etc. In recent years, easily available tools to generate deepfaked audio have increased the potential threat to ASV systems. In this paper, we compare the potential of human impersonation (voice disguise) based attacks with attacks based on machine-generated speech, on black-box and white-box ASV systems. We also study countermeasures by using features that capture the unique aspects of human speech production, under the hypothesis that machines cannot emulate many of the fine-level intricacies of the human speech production mechanism. We show that fundamental frequency sequence-related entropy, spectral envelope, and aperiodic parameters are promising candidates for robust detection of deepfaked speech generated by unknown methods.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا