No Arabic abstract
Conditional Variational AutoEncoder (CVAE) effectively increases the diversity and informativeness of responses in open-ended dialogue generation tasks through enriching the context vector with sampled latent variables. However, due to the inherent one-to-many and many-to-one phenomena in human dialogues, the sampled latent variables may not correctly reflect the contexts semantics, leading to irrelevant and incoherent generated responses. To resolve this problem, we propose Self-separated Conditional Variational AutoEncoder (abbreviated as SepaCVAE) that introduces group information to regularize the latent variables, which enhances CVAE by improving the responses relevance and coherence while maintaining their diversity and informativeness. SepaCVAE actively divides the input data into groups, and then widens the absolute difference between data pairs from distinct groups, while narrowing the relative distance between data pairs in the same group. Empirical results from automatic evaluation and detailed analysis demonstrate that SepaCVAE can significantly boost responses in well-established open-domain dialogue datasets.
Existing open-domain dialogue generation models are usually trained to mimic the gold response in the training set using cross-entropy loss on the vocabulary. However, a good response does not need to resemble the gold response, since there are multiple possible responses to a given prompt. In this work, we hypothesize that the current models are unable to integrate information from multiple semantically similar valid responses of a prompt, resulting in the generation of generic and uninformative responses. To address this issue, we propose an alternative to the end-to-end classification on vocabulary. We learn the pair relationship between the prompts and responses as a regression task on a latent space instead. In our novel dialog generation model, the representations of semantically related sentences are close to each other on the latent space. Human evaluation showed that learning the task on a continuous space can generate responses that are both relevant and informative.
Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.
Many existing conversation models that are based on the encoder-decoder framework have focused on ways to make the encoder more complicated to enrich the context vectors so as to increase the diversity and informativeness of generated responses. However, these approaches face two problems. First, the decoder is too simple to effectively utilize the previously generated information and tends to generate duplicated and self-contradicting responses. Second, the complex encoder tends to generate diverse but incoherent responses because the complex context vectors may deviate from the original semantics of context. In this work, we proposed a conversation model named THINK (Teamwork generation Hover around Impressive Noticeable Keywords) to make the decoder more complicated and avoid generating duplicated and self-contradicting responses. The model simplifies the context vectors and increases the coherence of generated responses in a reasonable way. For this model, we propose Teamwork generation framework and Semantics Extractor. Compared with other baselines, both automatic and human evaluation showed the advantages of our model.
Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and under-investigated. In this work, we first propose a novel learning objective, termed the principle-of-relevant-information variational autoencoder (PRI-VAE), to learn disentangled representations. We then present an information-theoretic perspective to analyze existing VAE models by inspecting the evolution of some critical information-theoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRI-VAE on four benchmark data sets.
Gravitational wave (GW) detection is now commonplace and as the sensitivity of the global network of GW detectors improves, we will observe $mathcal{O}(100)$s of transient GW events per year. The current methods used to estimate their source parameters employ optimally sensitive but computationally costly Bayesian inference approaches where typical analyses have taken between 6 hours and 5 days. For binary neutron star and neutron star black hole systems prompt counterpart electromagnetic (EM) signatures are expected on timescales of 1 second -- 1 minute and the current fastest method for alerting EM follow-up observers, can provide estimates in $mathcal{O}(1)$ minute, on a limited range of key source parameters. Here we show that a conditional variational autoencoder pre-trained on binary black hole signals can return Bayesian posterior probability estimates. The training procedure need only be performed once for a given prior parameter space and the resulting trained machine can then generate samples describing the posterior distribution $sim 6$ orders of magnitude faster than existing techniques.