Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Recent Developments on ESPnet Toolkit Boosted by Conformer

92 0 0.0 ( 0 )

Download Cite

Added by Pengcheng Guo

Publication date 2020

fields Electronic Engineering Informatics Engineering

and research's language is English

Authors Pengcheng Guo - Florian Boyer - Xuankai Chang

Audio and Speech Processing Sound

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

rate research

ESPnet-ST: All-in-One Speech Translation Toolkit

101 - Hirofumi Inaguma , Shun Kiyono , Kevin Duh 2020

We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets. Our reproducible results can match or even outperform the current state-of-the-art performances; these pre-trained models are downloadable. The toolkit is publicly available at https://github.com/espnet/espnet.

Computation and Language Sound Audio and Speech Processing

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

138 - Shinji Watanabe , Florian Boyer , Xuankai Chang 2020

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.

Audio and Speech Processing Sound

Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search

103 - Yukun Liu , Ta Li , Pengyuan Zhang 2021

Recently neural architecture search(NAS) has been successfully used in image classification, natural language processing, and automatic speech recognition(ASR) tasks for finding the state-of-the-art(SOTA) architectures than those human-designed architectures. NAS can derive a SOTA and data-specific architecture over validation data from a pre-defined search space with a search algorithm. Inspired by the success of NAS in ASR tasks, we propose a NAS-based ASR framework containing one search space and one differentiable search algorithm called Differentiable Architecture Search(DARTS). Our search space follows the convolution-augmented transformer(Conformer) backbone, which is a more expressive ASR architecture than those used in existing NAS-based ASR frameworks. To improve the performance of our method, a regulation method called Dynamic Search Schedule(DSS) is employed. On a widely used Mandarin benchmark AISHELL-1, our best-searched architecture outperforms the baseline Conform model significantly with about 11% CER relative improvement, and our method is proved to be pretty efficient by the search cost comparisons.

Audio and Speech Processing Sound

ESPnet-ST IWSLT 2021 Offline Speech Translation System

152 - Hirofumi Inaguma , Brian Yan , Siddharth Dalmia 2021

This paper describes the ESPnet-ST groups IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021.

Audio and Speech Processing Computation and Language Sound

Recent Developments on PIXE Simulation with Geant4

1072 - M. G. Pia , G. Weidenspointner (3 2009

Particle induced X-ray emission (PIXE) is an important physical effect that is not yet adequately modelled in Geant4. This paper provides a critical analysis of the problem domain associated with PIXE simulation and describes a set of software develo pments to improve PIXE simulation with Geant4. The capabilities of the developed software prototype are illustrated and applied to a study of the passive shielding of the X-ray detectors of the German eROSITA telescope on the upcoming Russian Spectrum-X-Gamma space mission.

Computational Physics

comments

Fetching comments

Al-Etihad University

Additional details More universities

Recent Developments on ESPnet Toolkit Boosted by Conformer

Ask ChatGPT about the research

No Arabic abstract

Read More