Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

70 0 0.0 ( 0 )

Download Cite

Added by Themos Stafylakis

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Themos Stafylakis - Muhammad Haris Khan - Georgios Tzimiropoulos

Computer Vision and Pattern Recognition

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Visual and audiovisual speech recognition are witnessing a renaissance which is largely due to the advent of deep learning methods. In this paper, we present a deep learning architecture for lipreading and audiovisual word recognition, which combines Residual Networks equipped with spatiotemporal input layers and Bidirectional LSTMs. The lipreading architecture attains 11.92% misclassification rate on the challenging Lipreading-In-The-Wild database, which is composed of excerpts from BBC-TV, each containing one of the 500 target words. Audiovisual experiments are performed using both intermediate and late integration, as well as several types and levels of environmental noise, and notable improvements over the audio-only network are reported, even in the case of clean speech. A further analysis on the utility of target word boundaries is provided, as well as on the capacity of the network in modeling the linguistic context of the target word. Finally, we examine difficult word pairs and discuss how visual information helps towards attaining higher recognition accuracy.

rate research

Pushing the Boundaries of View Extrapolation with Multiplane Images

104 - Pratul P. Srinivasan , Richard Tucker , Jonathan T. Barron 2019

We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGB$alpha$ planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to $4times$ the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth. Please see our results video at: https://www.youtube.com/watch?v=aJqAaMNL2m4.

Computer Vision and Pattern Recognition

Skin Cancer Recognition using Deep Residual Network

66 - Brij Rokad , Dr. Sureshkumar Nagarajan 2019

The advances in technology have enabled people to access internet from every part of the world. But to date, access to healthcare in remote areas is sparse. This proposed solution aims to bridge the gap between specialist doctors and patients. This prototype will be able to detect skin cancer from an image captured by the phone or any other camera. The network is deployed on cloud server-side processing for an even more accurate result. The Deep Residual learning model has been used for predicting the probability of cancer for server side The ResNet has three parametric layers. Each layer has Convolutional Neural Network, Batch Normalization, Maxpool and ReLU. Currently the model achieves an accuracy of 77% on the ISIC - 2017 challenge.

Computer Vision and Pattern Recognition Image and Video Processing

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

91 - Di Hu , Xuhong Li , Lichao Mou 2020

Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.

Computer Vision and Pattern Recognition Machine Learning Multimedia

Motion-Based Handwriting Recognition and Word Reconstruction

123 - Junshen Kevin Chen , Wanze Xie , Yutong He 2021

In this project, we leverage a trained single-letter classifier to predict the written word from a continuously written word sequence, by designing a word reconstruction pipeline consisting of a dynamic-programming algorithm and an auto-correction model. We conduct experiments to optimize models in this pipeline, then employ domain adaptation to explore using this pipeline on unseen data distributions.

Computer Vision and Pattern Recognition

LSTMs and Deep Residual Networks for Carbohydrate and Bolus Recommendations in Type 1 Diabetes Management

88 - Jeremy Beauchamp , Razvan Bunescu , Cindy Marling 2021

To avoid serious diabetic complications, people with type 1 diabetes must keep their blood glucose levels (BGLs) as close to normal as possible. Insulin dosages and carbohydrate consumption are important considerations in managing BGLs. Since the 1960s, models have been developed to forecast blood glucose levels based on the history of BGLs, insulin dosages, carbohydrate intake, and other physiological and lifestyle factors. Such predictions can be used to alert people of impending unsafe BGLs or to control insulin flow in an artificial pancreas. In past work, we have introduced an LSTM-based approach to blood glucose level prediction aimed at what if scenarios, in which people could enter foods they might eat or insulin amounts they might take and then see the effect on future BGLs. In this work, we invert the what-if scenario and introduce a similar architecture based on chaining two LSTMs that can be trained to make either insulin or carbohydrate recommendations aimed at reaching a desired BG level in the future. Leveraging a recent state-of-the-art model for time series forecasting, we then derive a novel architecture for the same recommendation task, in which the two LSTM chain is used as a repeating block inside a deep residual architecture. Experimental evaluations using real patient data from the OhioT1DM dataset show that the new integrated architecture compares favorably with the previous LSTM-based approach, substantially outperforming the baselines. The promising results suggest that this novel approach could potentially be of practical use to people with type 1 diabetes for self-management of BGLs.

Machine Learning