No Arabic abstract
Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generating the talking face video with accurate lip synchronization while maintaining smooth transition of both lip and facial movement over the entire video clip. Existing works either do not consider temporal dependency on face images across different video frames thus easily yielding noticeable/abrupt facial and lip movement or are only limited to the generation of talking face video for a specific person thus lacking generalization capacity. We propose a novel conditional video generation network where the audio input is treated as a condition for the recurrent adversarial network such that temporal dependency is incorporated to realize smooth transition for the lip and facial movement. In addition, we deploy a multi-task adversarial training scheme in the context of video generation to improve both photo-realism and the accuracy for lip synchronization. Finally, based on the phoneme distribution information extracted from the audio clip, we develop a sample selection method that effectively reduces the size of the training dataset without sacrificing the quality of the generated video. Extensive experiments on both controlled and uncontrolled datasets demonstrate the superiority of the proposed approach in terms of visual quality, lip sync accuracy, and smooth transition of lip and facial movement, as compared to the state-of-the-art.
In this paper, we explore the task of generating photo-realistic face images from lines. Previous methods based on conditional generative adversarial networks (cGANs) have shown their power to generate visually plausible images when a conditional image and an output image share well-aligned structures. However, these models fail to synthesize face images with a whole set of well-defined structures, e.g. eyes, noses, mouths, etc., especially when the conditional line map lacks one or several parts. To address this problem, we propose a conditional self-attention generative adversarial network (CSAGAN). We introduce a conditional self-attention mechanism to cGANs to capture long-range dependencies between different regions in faces. We also build a multi-scale discriminator. The large-scale discriminator enforces the completeness of global structures and the small-scale discriminator encourages fine details, thereby enhancing the realism of generated face images. We evaluate the proposed model on the CelebA-HD dataset by two perceptual user studies and three quantitative metrics. The experiment results demonstrate that our method generates high-quality facial images while preserving facial structures. Our results outperform state-of-the-art methods both quantitatively and qualitatively.
Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are 1) audio-visual synchronization (2) identity preservation of the target individual (3) plausible mouth movements (4) presence of natural eye blinks. The existing methods mostly address the audio-visual lip synchronization, and few recent works have addressed the synthesis of natural eye blinks for overall video realism. In this paper, we propose a method for identity-preserving realistic facial animation from speech. We first generate person-independent facial landmarks from audio using DeepSpeech features for invariance to different voices, accents, etc. To add realism, we impose eye blinks on facial landmarks using unsupervised learning and retargets the person-independent landmarks to person-specific landmarks to preserve the identity-related facial structure which helps in the generation of plausible mouth shapes of the target identity. Finally, we use LSGAN to generate the facial texture from person-specific facial landmarks, using an attention mechanism that helps to preserve identity-related texture. An extensive comparison of our proposed method with the current state-of-the-art methods demonstrates a significant improvement in terms of lip synchronization accuracy, image reconstruction quality, sharpness, and identity-preservation. A user study also reveals improved realism of our animation results over the state-of-the-art methods. To the best of our knowledge, this is the first work in speech-driven 2D facial animation that simultaneously addresses all the above-mentioned attributes of a realistic speech-driven face animation.
While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. Previous methods rely on pre-estimated structural information such as landmarks and 3D parameters, aiming to generate personalized rhythmic movements. However, the inaccuracy of such estimated information under extreme conditions would lead to degradation problems. In this paper, we propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. The key is to modularize audio-visual representations by devising an implicit low-dimension pose code. Substantially, both speech content and head pose information lie in a joint non-identity embedding space. While speech content information can be defined by learning the intrinsic synchronization between audio-visual modalities, we identify that a pose code will be complementarily learned in a modulated convolution-based reconstruction framework. Extensive experiments show that our method generates accurately lip-synced talking faces whose poses are controllable by other videos. Moreover, our model has multiple advanced capabilities including extreme view robustness and talking face frontalization. Code, models, and demo videos are available at https://hangz-nju-cuhk.github.io/projects/PC-AVS.
We are interested in attribute-guided face generation: given a low-res face input image, an attribute vector that can be extracted from a high-res image (attribute image), our new method generates a high-res face image for the low-res input that satisfies the given attributes. To address this problem, we condition the CycleGAN and propose conditional CycleGAN, which is designed to 1) handle unpaired training data because the training low/high-res and high-res attribute images may not necessarily align with each other, and to 2) allow easy control of the appearance of the generated face via the input attributes. We demonstrate impressive results on the attribute-guided conditional CycleGAN, which can synthesize realistic face images with appearance easily controlled by user-supplied attributes (e.g., gender, makeup, hair color, eyeglasses). Using the attribute image as identity to produce the corresponding conditional vector and by incorporating a face verification network, the attribute-guided network becomes the identity-guided conditional CycleGAN which produces impressive and interesting results on identity transfer. We demonstrate three applications on identity-guided conditional CycleGAN: identity-preserving face superresolution, face swapping, and frontal face generation, which consistently show the advantage of our new method.
Generative models using neural network have opened a door to large-scale studies for various application domains, especially for studies that suffer from lack of real samples to obtain statistically robust inference. Typically, these generative models would train on existing data to learn the underlying distribution of the measurements (e.g., images) in latent spaces conditioned on covariates (e.g., image labels), and generate independent samples that are identically distributed in the latent space. Such models may work for cross-sectional studies, however, they are not suitable to generate data for longitudinal studies that focus on progressive behavior in a sequence of data. In practice, this is a quite common case in various neuroimaging studies whose goal is to characterize a trajectory of pathologies of a specific disease even from early stages. This may be too ambitious especially when the sample size is small (e.g., up to a few hundreds). Motivated from the setup above, we seek to develop a conditional generative model for longitudinal data generation by designing an invertable neural network. Inspired by recurrent nature of longitudinal data, we propose a novel neural network that incorporates recurrent subnetwork and context gating to include smooth transition in a sequence of generated data. Our model is validated on a video sequence dataset and a longitudinal AD dataset with various experimental settings for qualitative and quantitative evaluations of the generated samples. The results with the AD dataset captures AD specific group differences with sufficiently generated longitudinal samples that are consistent with existing literature, which implies a great potential to be applicable to other disease studies.