ترغب بنشر مسار تعليمي؟ اضغط هنا

149 - Songxiang Liu , Shan Yang , Dan Su 2021
Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speakers voice. Most previous CSST approaches rely on expensive high-quality data carrying desired speak ing style during training and require a reference utterance to obtain speaking style descriptors as conditioning on the generation of a new sentence. This work presents Referee, a robust reference-free CSST approach for expressive TTS, which fully leverages low-quality data to learn speaking styles from text. Referee is built by cascading a text-to-style (T2S) model with a style-to-wave (S2W) model. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using the T2S model. A novel pretrain-refinement method is adopted to learn a robust T2S model by only using readily accessible low-quality data. The S2W model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speakers voice. Experimental results are presented, showing that Referee outperforms a global-style-token (GST)-based baseline approach in CSST.
108 - Kui Liu , Jie Wu , Zhishan Yang 2021
Denote by $tau$ k (n), $omega$(n) and $mu$ 2 (n) the number of representations of n as product of k natural numbers, the number of distinct prime factors of n and the characteristic function of the square-free integers, respectively. Let [t] be the i ntegral part of real number t. For f = $omega$, 2 $omega$ , $mu$ 2 , $tau$ k , we prove that n x f x n = x d 1 f (d) d(d + 1) + O $epsilon$ (x $theta$ f +$epsilon$) for x $rightarrow$ $infty$, where $theta$ $omega$ = 53 110 , $theta$ 2 $omega$ = 9 19 , $theta$ $mu$2 = 2 5 , $theta$ $tau$ k = 5k--1 10k--1 and $epsilon$ > 0 is an arbitrarily small positive number. These improve the corresponding results of Bordell{`e}s.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchm arks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion) is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
91 - Bo Hu , Bryan Seybold , Shan Yang 2021
We present a method to infer the 3D pose of mice, including the limbs and feet, from monocular videos. Many human clinical conditions and their corresponding animal models result in abnormal motion, and accurately measuring 3D motion at scale offers insights into health. The 3D poses improve classification of health-related attributes over 2D representations. The inferred poses are accurate enough to estimate stride length even when the feet are mostly occluded. This method could be applied as part of a continuous monitoring system to non-invasively measure animal health.
Recently, the transductive graph-based methods have achieved great success in the few-shot classification task. However, most existing methods ignore exploring the class-level knowledge that can be easily learned by humans from just a handful of samp les. In this paper, we propose an Explicit Class Knowledge Propagation Network (ECKPN), which is composed of the comparison, squeeze and calibration modules, to address this problem. Specifically, we first employ the comparison module to explore the pairwise sample relations to learn rich sample representations in the instance-level graph. Then, we squeeze the instance-level graph to generate the class-level graph, which can help obtain the class-level visual knowledge and facilitate modeling the relations of different classes. Next, the calibration module is adopted to characterize the relations of the classes explicitly to obtain the more discriminative class-level knowledge representations. Finally, we combine the class-level knowledge with the instance-level sample representations to guide the inference of the query samples. We conduct extensive experiments on four few-shot classification benchmarks, and the experimental results show that the proposed ECKPN significantly outperforms the state-of-the-art methods.
We report femtosecond optical pump and x-ray diffraction probe experiments on SnSe. We find that under photoexcitation, SnSe has an instability towards an orthorhombically-distorted rocksalt structure that is not present in the equilibrium phase diag ram. The new lattice instability is accompanied by a drastic softening of the lowest frequency A$_g$ phonon which is usually associated with the thermodynamic Pnma-Cmcm transition. However, our reconstruction of the transient atomic displacements shows that instead of moving towards the Cmcm structure, the material moves towards a more symmetric orthorhombic distortion of the rock-salt structure belonging to the Immm space group. The experimental results combined with density functional theory (DFT) simulations show that photoexcitation can act as a state-selective perturbation of the electronic distribution, in this case by promoting electrons from Se 4$p$ Sn 5$s$ derived bands from deep below the Fermi level. The subsequent potential energy landscape modified by such electronic excitation can reveal minima with metastable phases that are distinct from those accessible in equilibrium. These results may have implications for optical control of the thermoelectric, ferroelectric and topological properties of the monochalcogenides and related materials.
Few-shot relation extraction (FSRE) is of great importance in long-tail distribution problem, especially in special domain with low-resource data. Most existing FSRE algorithms fail to accurately classify the relations merely based on the information of the sentences together with the recognized entity pairs, due to limited samples and lack of knowledge. To address this problem, in this paper, we proposed a novel entity CONCEPT-enhanced FEw-shot Relation Extraction scheme (ConceptFERE), which introduces the inherent concepts of entities to provide clues for relation prediction and boost the relations classification performance. Firstly, a concept-sentence attention module is developed to select the most appropriate concept from multiple concepts of each entity by calculating the semantic similarity between sentences and concepts. Secondly, a self-attention based fusion module is presented to bridge the gap of concept embedding and sentence embedding from different semantic spaces. Extensive experiments on the FSRE benchmark dataset FewRel have demonstrated the effectiveness and the superiority of the proposed ConceptFERE scheme as compared to the state-of-the-art baselines. Code is available at https://github.com/LittleGuoKe/ConceptFERE.
211 - Kui Liu , Jie Wu , Zhishan Yang 2021
Let $Lambda(n)$ be the von Mangoldt function, and let $[t]$ be the integral part of real number $t$. In this note, we prove that for any $varepsilon>0$ the asymptotic formula $$ sum_{nle x} LambdaBig(Big[frac{x}{n}Big]Big) = xsum_{dge 1} frac{Lambda( d)}{d(d+1)} + O_{varepsilon}big(x^{9/19+varepsilon}big) qquad (xtoinfty)$$ holds. This improves a recent result of Bordell`es, which requires $frac{97}{203}$ in place of $frac{9}{19}$.
Remarkable progress has been made in 3D reconstruction of rigid structures from a video or a collection of images. However, it is still challenging to reconstruct nonrigid structures from RGB inputs, due to its under-constrained nature. While templat e-based approaches, such as parametric shape models, have achieved great success in modeling the closed world of known object categories, they cannot well handle the open-world of novel object categories or outlier shapes. In this work, we introduce a template-free approach to learn 3D shapes from a single video. It adopts an analysis-by-synthesis strategy that forward-renders object silhouette, optical flow, and pixel values to compare with video observations, which generates gradients to adjust the camera, shape and motion parameters. Without using a category-specific shape template, our method faithfully reconstructs nonrigid 3D structures from videos of human, animals, and objects of unknown classes. Code will be available at lasr-google.github.io .
We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion conditioned on music. The proposed AIST++ dataset contains 5.2 hours of 3D dan ce motion in 1408 sequences, covering 10 dance genres with multi-view videos with known camera poses -- the largest dataset of this kind to our knowledge. We show that naively applying sequence models such as transformers to this dataset for the task of music conditioned 3D motion generation does not produce satisfactory 3D motion that is well correlated with the input music. We overcome these shortcomings by introducing key changes in its architecture design and supervision: FACT model involves a deep cross-modal transformer block with full-attention that is trained to predict $N$ future motions. We empirically show that these changes are key factors in generating long sequences of realistic dance motion that are well-attuned to the input music. We conduct extensive experiments on AIST++ with user studies, where our method outperforms recent state-of-the-art methods both qualitatively and quantitatively.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا