No Arabic abstract
Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with existing methods. This paper presents a novel lip-sync framework specially designed for producing high-fidelity virtual news anchors. A pair of Temporal Convolutional Networks are used to learn the cross-modal sequential mapping from audio signals to mouth movements, followed by a neural rendering network that translates the synthetic facial map into a high-resolution and photorealistic appearance. This fully trainable framework provides end-to-end processing that outperforms traditional graphics-based methods in many low-delay applications. Experiments also show the framework has advantages over modern neural-based methods in both visual appearance and efficiency.
Lip-reading aims to recognize speech content from videos via visual analysis of speakers lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as diverse lip appearances and motion patterns among the speakers. To address these challenges, we propose a novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process. Specifically, we first extract both frame-level fine-grained features and short-term medium-grained features by the visual front-end, which are then combined to obtain discriminative representations for words with similar phonemes. Next, a bidirectional ConvLSTM augmented with temporal attention aggregates spatio-temporal information in the entire input sequence, which is expected to be able to capture the coarse-gained patterns of each word and robust to various conditions in speaker identity, lighting conditions, and so on. By making full use of the information from different levels in a unified framework, the model is not only able to distinguish words with similar pronunciations, but also becomes robust to appearance changes. We evaluate our method on two challenging word-level lip-reading benchmarks and show the effectiveness of the proposed method, which also demonstrate the above claims.
In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.
We propose a novel framework for creating large-scale photorealistic datasets of indoor scenes, with ground truth geometry, material, lighting and semantics. Our goal is to make the dataset creation process widely accessible, transforming scans into photorealistic datasets with high-quality ground truth for appearance, layout, semantic labels, high quality spatially-varying BRDF and complex lighting, including direct, indirect and visibility components. This enables important applications in inverse rendering, scene understanding and robotics. We show that deep networks trained on the proposed dataset achieve competitive performance for shape, material and lighting estimation on real images, enabling photorealistic augmented reality applications, such as object insertion and material editing. We also show our semantic labels may be used for segmentation and multi-task learning. Finally, we demonstrate that our framework may also be integrated with physics engines, to create virtual robotics environments with unique ground truth such as friction coefficients and correspondence to real scenes. The dataset and all the tools to create such datasets will be made publicly available.
We present a data-driven inference method that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild. After an initial estimation of shape and low-frequency albedo, we compute a high-frequency partial texture map, without the shading component, of the visible face area. To extract the fine appearance details from this incomplete input, we introduce a multi-scale detail analysis technique based on mid-layer feature correlations extracted from a deep convolutional neural network. We demonstrate that fitting a convex combination of feature correlations from a high-resolution face database can yield a semantically plausible facial detail description of the entire face. A complete and photorealistic texture map can then be synthesized by iteratively optimizing for the reconstructed feature correlations. Using these high-resolution textures and a commercial rendering framework, we can produce high-fidelity 3D renderings that are visually comparable to those obtained with state-of-the-art multi-view face capture systems. We demonstrate successful face reconstructions from a wide range of low resolution input images, including those of historical figures. In addition to extensive evaluations, we validate the realism of our results using a crowdsourced user study.
In Virtual, augmented, and mixed reality, the use of hand gestures is increasingly becoming popular to reduce the difference between the virtual and real world. The precise location of the fingertip is essential/crucial for a seamless experience. Much of the research work is based on using depth information for the estimation of the fingertips position. However, most of the work using RGB images for fingertips detection is limited to a single finger. The detection of multiple fingertips from a single RGB image is very challenging due to various factors. In this paper, we propose a deep neural network (DNN) based methodology to estimate the fingertips position. We christened this methodology as an Anchor based Fingertips Position Estimation (ABFPE), and it is a two-step process. The fingertips location is estimated using regression by computing the difference in the location of a fingertip from the nearest anchor point. The proposed framework performs the best with limited dependence on hand detection results. In our experiments on the SCUT-Ego-Gesture dataset, we achieved the fingertips detection error of 2.3552 pixels on a video frame with a resolution of $640 times 480$ and about $92.98%$ of test images have average pixel errors of five pixels.