Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

JSSS: free Japanese speech corpus for summarization and simplification

394 0 0.0 ( 0 )

Download Cite

Added by Shinnosuke Takamichi

Publication date 2020

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Shinnosuke Takamichi - Mamoru Komachi - Naoko Tanji

Audio and Speech Processing Sound

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper, we construct a new Japanese speech corpus for speech-based summarization and simplification, JSSS (pronounced j-triple-s). Given the success of reading-style speech synthesis from short-form sentences, we aim to design more difficult tasks for delivering information to humans. Our corpus contains voices recorded for two tasks that have a role in providing information under constraints: duration-constrained text-to-speech summarization and speaking-style simplification. It also contains utterances of long-form sentences as an optional task. This paper describes how we designed the corpus, which is available on our project page.

rate research

JVS corpus: free Japanese multi-speaker voice corpus

92 - Shinnosuke Takamichi , Kentaro Mitsui , Yuki Saito 2019

Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered by a single speaker, for end-to-end text-to-speech synthesis. For more general use in speech synthesis research, e.g., voice conversion and multi-speaker modeling, in this paper, we construct the JVS corpus, which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices. This paper describes how we designed the corpus and summarizes the specifications. The corpus is available at our project page.

Sound Audio and Speech Processing

Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition

120 - Yih-Liang Shen , Chao-Yuan Huang , Syu-Siang Wang 2018

Conventional deep neural network (DNN)-based speech enhancement (SE) approaches aim to minimize the mean square error (MSE) between enhanced speech and clean reference. The MSE-optimized model may not directly improve the performance of an automatic speech recognition (ASR) system. If the target is to minimize the recognition error, the recognition results should be used to design the objective function for optimizing the SE model. However, the structure of an ASR system, which consists of multiple units, such as acoustic and language models, is usually complex and not differentiable. In this study, we proposed to adopt the reinforcement learning algorithm to optimize the SE model based on the recognition results. We evaluated the propsoed SE system on the Mandarin Chinese broadcast news corpus (MATBN). Experimental results demonstrate that the proposed method can effectively improve the ASR results with a notable 12.40% and 19.23% error rate reductions for signal to noise ratio at 0 dB and 5 dB conditions, respectively.

Audio and Speech Processing Sound

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

106 - Fadi Biadsy , Ron J. Weiss , Pedro J. Moreno 2019

We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task

Audio and Speech Processing Sound

DDS: A new device-degraded speech dataset for speech enhancement

332 - Haoyu Li , Junichi Yamagishi 2021

A large and growing amount of speech content in real-life scenarios is being recorded on common consumer devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number

Audio and Speech Processing Sound

USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments

89 - Muhammadjon Musaev , Saida Mussakhojayeva , Ilyos Khujayorov 2021

We present a freely available speech corpus for the Uzbek language and report preliminary automatic speech recognition (ASR) results using both the deep neural network hidden Markov model (DNN-HMM) and end-to-end (E2E) architectures. The Uzbek speech corpus (USC) comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. To the best of our knowledge, this is the first open-source Uzbek speech corpus dedicated to the ASR task. To ensure high quality, the USC has been manually checked by native speakers. We first describe the design and development procedures of the USC, and then explain the conducted ASR experiments in detail. The experimental results demonstrate promising results for the applicability of the USC for ASR. Specifically, 18.1% and 17.4% word error rates were achieved on the validation and test sets, respectively. To enable experiment reproducibility, we share the USC dataset, pre-trained models, and training recipes in our GitHub repository.

Audio and Speech Processing Computation and Language

comments

Fetching comments

AlHawash Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

JSSS: free Japanese speech corpus for summarization and simplification

Ask ChatGPT about the research

No Arabic abstract

Read More