Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Detecting and analysing spontaneous oral cancer speech in the wild

67 0 0.0 ( 0 )

Download Cite

Added by Bence Halpern

Publication date 2020

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Bence Mark Halpern - Rob van Son - Michiel van den Brekel

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Oral cancer speech is a disease which impacts more than half a million people worldwide every year. Analysis of oral cancer speech has so far focused on read speech. In this paper, we 1) present and 2) analyse a three-hour long spontaneous oral cancer speech dataset collected from YouTube. 3) We set baselines for an oral cancer speech detection task on this dataset. The analysis of these explainable machine learning baselines shows that sibilants and stop consonants are the most important indicators for spontaneous oral cancer speech detection.

rate research

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

127 - Nanxin Chen , Shinji Watanabe , Jesus Villalba 2019

Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication.

Audio and Speech Processing Machine Learning Sound

Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases

133 - Subhashini Venugopalan , Joel Shor , Manoj Plakal 2021

Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We found that the ASR encoders embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words.

Audio and Speech Processing Machine Learning Sound

Dissortativity and duplications in Oral cancer

136 - Pramod Shinde , Alok Yadav , Aparna Rai 2016

More than 300,000 new cases worldwide are being diagnosed with oral cancer annually. Complexity of oral cancer renders designing drug targets very difficult. We analyse protein-protein interaction network for the normal and oral cancer tissue and detect crucial changes in the structural properties of the networks in terms of the interactions of the hub proteins and the degree-degree correlations. Further analysis of the spectra of both the networks, while exhibiting universal statistical behavior, manifest distinction in terms of the zero degeneracy, providing insight to the complexity of the underlying system.

Molecular Networks

Attention is All You Need in Speech Separation

145 - Cem Subakan , Mirco Ravanelli , Samuele Cornell 2020

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

Audio and Speech Processing Machine Learning Sound

Training Keyword Spotters with Limited and Synthesized Speech Data

121 - James Lin , Kevin Kilgour , Dominik Roblek 2020

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.

Audio and Speech Processing Machine Learning Sound

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Detecting and analysing spontaneous oral cancer speech in the wild

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions