No Arabic abstract
Recent studies have demonstrated that analysis of laboratory-quality voice recordings can be used to accurately differentiate people diagnosed with Parkinsons disease (PD) from healthy controls (HC). These findings could help facilitate the development of remote screening and monitoring tools for PD. In this study, we analyzed 2759 telephone-quality voice recordings from 1483 PD and 15321 recordings from 8300 HC participants. To account for variations in phonetic backgrounds, we acquired data from seven countries. We developed a statistical framework for analyzing voice, whereby we computed 307 dysphonia measures that quantify different properties of voice impairment, such as, breathiness, roughness, monopitch, hoarse voice quality, and exaggerated vocal tremor. We used feature selection algorithms to identify robust parsimonious feature subsets, which were used in combination with a Random Forests (RF) classifier to accurately distinguish PD from HC. The best 10-fold cross-validation performance was obtained using Gram-Schmidt Orthogonalization (GSO) and RF, leading to mean sensitivity of 64.90% (standard deviation, SD 2.90%) and mean specificity of 67.96% (SD 2.90%). This large-scale study is a step forward towards assessing the development of a reliable, cost-effective and practical clinical decision support tool for screening the population at large for PD using telephone-quality voice.
About 90 percent of people with Parkinsons disease (PD) experience decreased functional communication due to the presence of voice and speech disorders associated with dysarthria that can be characterized by monotony of pitch (or fundamental frequency), reduced loudness, irregular rate of speech, imprecise consonants, and changes in voice quality. Speech-language pathologists (SLPs) work with patients with PD to improve speech intelligibility using various intensive in-clinic speech treatments. SLPs also prescribe home exercises to enhance generalization of speech strategies outside of the treatment room. Even though speech therapies are found to be highly effective in improving vocal loudness and speech quality, patients with PD find it difficult to follow the prescribed exercise regimes outside the clinic and to continue exercises once the treatment is completed. SLPs need techniques to monitor compliance and accuracy of their patients exercises at home and in ecologically valid communication situations. We have designed EchoWear, a smartwatch-based system, to remotely monitor speech and voice exercises as prescribed by SLPs. We conducted a study of 6 individuals; three with PD and three healthy controls. To assess the performance of EchoWear technology compared with high quality audio equipment obtained in a speech laboratory. Our preliminary analysis shows promising outcomes for using EchoWear in speech therapies for people with PD. Keywords: Dysarthria; knowledge-based speech processing; Parkinsons disease; smartwatch; speech therapy; wearable system.
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers.
The perceptual task of speech quality assessment (SQA) is a challenging task for machines to do. Objective SQA methods that rely on the availability of the corresponding clean reference have been the primary go-to approaches for SQA. Clearly, these methods fail in real-world scenarios where the ground truth clean references are not available. In recent years, non-intrusive methods that train neural networks to predict ratings or scores have attracted much attention, but they suffer from several shortcomings such as lack of robustness, reliance on labeled data for training and so on. In this work, we propose a new direction for speech quality assessment. Inspired by humans innate ability to compare and assess the quality of speech signals even when they have non-matching contents, we propose a novel framework that predicts a subjective relative quality score for the given speech signal with respect to any provided reference without using any subjective data. We show that neural networks trained using our framework produce scores that correlate well with subjective mean opinion scores (MOS) and are also competitive to methods such as DNSMOS, which explicitly relies on MOS from humans for training networks. Moreover, our method also provides a natural way to embed quality-related information in neural networks, which we show is helpful for downstream tasks such as speech enhancement.
In recent years, there are many research cases for the diagnosis of Parkinsons disease (PD) with the brain magnetic resonance imaging (MRI) by utilizing the traditional unsupervised machine learning methods and the supervised deep learning models. However, unsupervised learning methods are not good at extracting accurate features among MRIs and it is difficult to collect enough data in the field of PD to satisfy the need of training deep learning models. Moreover, most of the existing studies are based on single-view MRI data, of which data characteristics are not sufficient enough. In this paper, therefore, in order to tackle the drawbacks mentioned above, we propose a novel semi-supervised learning framework called Semi-supervised Multi-view learning Clustering architecture technology (SMC). The model firstly introduces the sliding window method to grasp different features, and then uses the dimensionality reduction algorithms of Linear Discriminant Analysis (LDA) to process the data with different features. Finally, the traditional single-view clustering and multi-view clustering methods are employed on multiple feature views to obtain the results. Experiments show that our proposed method is superior to the state-of-art unsupervised learning models on the clustering effect. As a result, it may be noted that, our work could contribute to improving the effectiveness of identifying PD by previous labeled and subsequent unlabeled medical MRI data in the realistic medical environment.
Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convolutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15)