No Arabic abstract
This paper proposes a new model, called condition-transforming variational autoencoder (CTVAE), to improve the performance of conversation response generation using conditional variational autoencoders (CVAEs). In conventional CVAEs , the prior distribution of latent variable z follows a multivariate Gaussian distribution with mean and variance modulated by the input conditions. Previous work found that this distribution tends to become condition independent in practical application. In our proposed CTVAE model, the latent variable z is sampled by performing a non-lineartransformation on the combination of the input conditions and the samples from a condition-independent prior distribution N (0; I). In our objective evaluations, the CTVAE model outperforms the CVAE model on fluency metrics and surpasses a sequence-to-sequence (Seq2Seq) model on diversity metrics. In subjective preference tests, our proposed CTVAE model performs significantly better than CVAE and Seq2Seq models on generating fluency, informative and topic relevant responses.
This paper presents an emotion-regularized conditional variational autoencoder (Emo-CVAE) model for generating emotional conversation responses. In conventional CVAE-based emotional response generation, emotion labels are simply used as additional conditions in prior, posterior and decoder networks. Considering that emotion styles are naturally entangled with semantic contents in the language space, the Emo-CVAE model utilizes emotion labels to regularize the CVAE latent space by introducing an extra emotion prediction network. In the training stage, the estimated latent variables are required to predict the emotion labels and token sequences of the input responses simultaneously. Experimental results show that our Emo-CVAE model can learn a more informative and structured latent space than a conventional CVAE model and output responses with better content and emotion performance than baseline CVAE and sequence-to-sequence (Seq2Seq) models.
We investigate large-scale latent variable models (LVMs) for neural story generation -- an under-explored application for open-domain long text -- with objectives in two threads: generation effectiveness and controllability. LVMs, especially the variational autoencoder (VAE), have achieved both effective and controllable generation through exploiting flexible distributional latent representations. Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation. In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models -- GPT2 specifically in this paper. Experiments demonstrate state-of-the-art conditional generation ability of our model, as well as its excellent representation learning capability and controllability.
We introduce an improved variational autoencoder (VAE) for text modeling with topic information explicitly modeled as a Dirichlet latent variable. By providing the proposed model topic awareness, it is more superior at reconstructing input texts. Furthermore, due to the inherent interactions between the newly introduced Dirichlet variable and the conventional multivariate Gaussian variable, the model is less prone to KL divergence vanishing. We derive the variational lower bound for the new model and conduct experiments on four different data sets. The results show that the proposed model is superior at text reconstruction across the latent space and classifications on learned representations have higher test accuracies.
We present our work on Track 2 in the Dialog System Technology Challenges 7 (DSTC7). The DSTC7-Track 2 aims to evaluate the response generation of fully data-driven conversation models in knowledge-grounded settings, which provides the contextual-relevant factual texts. The Sequenceto-Sequence models have been widely used for end-to-end generative conversation modelling and achieved impressive results. However, they tend to output dull and repeated responses in previous studies. Our work aims to promote the diversity for end-to-end conversation response generation, which follows a two-stage pipeline: 1) Generate multiple responses. At this stage, two different models are proposed, i.e., a variational generative (VariGen) model and a retrieval based (Retrieval) model. 2) Rank and return the most related response by training a topic coherence discrimination (TCD) model for the ranking process. According to the official evaluation results, our proposed Retrieval and VariGen systems ranked first and second respectively on objective diversity metrics, i.e., Entropy, among all participant systems. And the VariGen system ranked second on NIST and METEOR metrics.
Despite the great promise of Transformers in many sequence modeling tasks (e.g., machine translation), their deterministic nature hinders them from generalizing to high entropy tasks such as dialogue response generation. Previous work proposes to capture the variability of dialogue responses with a recurrent neural network (RNN)-based conditional variational autoencoder (CVAE). However, the autoregressive computation of the RNN limits the training efficiency. Therefore, we propose the Variational Transformer (VT), a variational self-attentive feed-forward sequence model. The VT combines the parallelizability and global receptive field of the Transformer with the variational nature of the CVAE by incorporating stochastic latent variables into Transformers. We explore two types of the VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of fine-grained latent variables. Then, the proposed models are evaluated on three conversational datasets with both automatic metric and human evaluation. The experimental results show that our models improve standard Transformers and other baselines in terms of diversity, semantic relevance, and human judgment.