New community

Subscribe to the gold package and get unlimited access to Shamra Academy

It$hat{text{o}}$TTS and It$hat{text{o}}$Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

51 0 0.0 ( 0 )

Download Cite

Added by Ziqiang Shi

Publication date 2021

fields Informatics Engineering Electronic Engineering

and research's language is English

Authors Shoule Wu - Ziqiang Shi

Sound Audio and Speech Processing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called It$hat{text{o}}$TTS, and the model that generates wave is called It$hat{text{o}}$Wave. It$hat{text{o}}$TTS and It$hat{text{o}}$Wave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of It$hat{text{o}}$TTS and It$hat{text{o}}$Wave can exceed the current state-of-the-art methods, and reached 3.925$pm$0.160 and 4.35$pm$0.115 respectively. The generated audio samples are available at https://shiziqiang.github.io/ito_audio/. All authors contribute equally to this work.

rate research

All You Need Is Boundary: Toward Arbitrary-Shaped Text Spotting

78 - Hao Wang , Pu Lu , Hui Zhang 2019

Recently, end-to-end text spotting that aims to detect and recognize text from cluttered images simultaneously has received particularly growing interest in computer vision. Different from the existing approaches that formulate text detection as bounding box extraction or instance segmentation, we localize a set of points on the boundary of each text instance. With the representation of such boundary points, we establish a simple yet effective scheme for end-to-end text spotting, which can read the text of arbitrary shapes. Experiments on three challenging datasets, including ICDAR2015, TotalText and COCO-Text demonstrate that the proposed method consistently surpasses the state-of-the-art in both scene text detection and end-to-end text recognition tasks.

Computer Vision and Pattern Recognition

All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

410 - Meng Cao , Can Zhang , Dongming Yang 2021

Arbitrary-shaped text detection is a challenging task since curved texts in the wild are of the complex geometric layouts. Existing mainstream methods follow the instance segmentation pipeline to obtain the text regions. However, arbitraryshaped texts are difficult to be depicted through one single segmentation network because of the varying scales. In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection. Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations. Specifically, NASK is composed of a Text Instance Segmentation (TIS) network (1st stage), a Geometry-aware Text RoI Alignment (GeoAlign) module, and a Fiducial pOint eXpression (FOX) module (2nd stage). Firstly, TIS extracts the augmented features with a novel Group Spatial and Channel Attention (GSCA) module and conducts instance segmentation to obtain rectangle proposals. Then, GeoAlign converts these rectangles into the fixed size and encodes RoI-wise feature representation. Finally, FOX disintegrates the text instance into serval pivotal geometrical attributes to refine the detection results. Extensive experimental results on three public benchmarks including Total-Text, SCUTCTW1500, and ICDAR 2015 verify that our NASK outperforms recent state-of-the-art methods.

Computer Vision and Pattern Recognition

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

79 - Xuenan Xu , Heinrich Dinkel , Mengyue Wu 2021

Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship between audio processing and language understanding. A baseline approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

Sound Audio and Speech Processing

All you need is a second look: Towards Tighter Arbitrary shape text detection

142 - Meng Cao , Yuexian Zou 2020

Deep learning-based scene text detection methods have progressed substantially over the past years. However, there remain several problems to be solved. Generally, long curve text instances tend to be fragmented because of the limited receptive field size of CNN. Besides, simple representations using rectangle or quadrangle bounding boxes fall short when dealing with more challenging arbitrary-shaped texts. In addition, the scale of text instances varies greatly which leads to the difficulty of accurate prediction through a single segmentation network. To address these problems, we innovatively propose a two-stage segmentation based arbitrary text detector named textit{NASK} (textbf{N}eed textbf{A} textbf{S}econd lootextbf{K}). Specifically, textit{NASK} consists of a Text Instance Segmentation network namely textit{TIS} ((1^{st}) stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as textit{FOX} ((2^{nd}) stage). Firstly, textit{TIS} conducts instance segmentation to obtain rectangle text proposals with a proposed Group Spatial and Channel Attention module (textit{GSCA}) to augment the feature expression. Then, Text RoI Pooling transforms these rectangles to the fixed size. Finally, textit{FOX} is introduced to reconstruct text instances with a more tighter representation using the predicted geometrical attributes including text center line, text line orientation, character scale and character orientation. Experimental results on two public benchmarks including textit{Total-Text} and textit{SCUT-CTW1500} have demonstrated that the proposed textit{NASK} achieves state-of-the-art results.

Computer Vision and Pattern Recognition

Segmentation is All You Need

151 - Zehua Cheng , Yuxiang Wu , Zhenghua Xu 2019

Region proposal mechanisms are essential for existing deep learning approaches to object detection in images. Although they can generally achieve a good detection performance under normal circumstances, their recall in a scene with extreme cases is unacceptably low. This is mainly because bounding box annotations contain much environment noise information, and non-maximum suppression (NMS) is required to select target boxes. Therefore, in this paper, we propose the first anchor-free and NMS-free object detection model called weakly supervised multimodal annotation segmentation (WSMA-Seg), which utilizes segmentation models to achieve an accurate and robust object detection without NMS. In WSMA-Seg, multimodal annotations are proposed to achieve an instance-aware segmentation using weakly supervised bounding boxes; we also develop a run-data-based following algorithm to trace contours of objects. In addition, we propose a multi-scale pooling segmentation (MSP-Seg) as the underlying segmentation model of WSMA-Seg to achieve a more accurate segmentation and to enhance the detection accuracy of WSMA-Seg. Experimental results on multiple datasets show that the proposed WSMA-Seg approach outperforms the state-of-the-art detectors.

Computer Vision and Pattern Recognition

comments

Fetching comments

Oran 1 University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

It$hat{text{o}}$TTS and It$hat{text{o}}$Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

Ask ChatGPT about the research

No Arabic abstract

Read More