DiffWave: A Versatile Diffusion Model for Audio Synthesis.

doi:10.48550/arxiv.2009.09761

10.48550/arxiv.2009.09761

DiffWave: A Versatile Diffusion Model for Audio Synthesis.

Zhifeng Kong, +4 more

174

TL;DR: DiffWave is a non-autoregressive diffusion model for audio synthesis, producing high-fidelity audios in various tasks, outperforming WaveNet vocoder in speech quality and outperforming autoregressive and GAN-based models in unconditional generation.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/TASLP.2023.3288409

AudioLM: A Language Modeling Approach to Audio Generation

Zalán Borsos, +10 more

- 07 Sep 2022

- IEEE/ACM transactions on audio, speech, ...

TL;DR: The proposed hybrid tokenization scheme leverages the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.

...read moreread less

366

•Posted Content•10.48550/arxiv.2204.03458

Video Diffusion Models

07 Apr 2022

TL;DR: The authors proposed a diffusion model for video generation, which is a natural extension of the standard image diffusion architecture and enables jointly training from image and video data, which they find to reduce the variance of minibatch gradients and speed up optimization.

...read moreread less

246

Journal Article•10.48550/arxiv.2308.06571

ModelScope Text-to-Video Technical Report

Jiuniu Wang, +5 more

- 12 Aug 2023

- arXiv.org

TL;DR: The ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions and demonstrates superior performance over state-of-the-art methods across three evaluation metrics.

...read moreread less

205

Proceedings Article•10.1145/3503161.3547855

ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech

Rongjie Huang, +5 more

- 13 Jul 2022

TL;DR: ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling, and enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time.

...read moreread less

167

•Posted Content•10.1145/3592458

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Simon Alexanderson, +3 more

- 17 Nov 2022

- arXiv.org

TL;DR: In this article , the authors adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power, and demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression.

...read moreread less

152

...

Expand

References

•Proceedings Article•10.1109/CVPR.2017.634

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie, +4 more

- 21 Jul 2017

TL;DR: ResNeXt as discussed by the authors is a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology.

...read moreread less

11.2K

•Proceedings Article•10.1109/ICASSP.2018.8461368

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Jonathan Shen, +12 more

- 15 Apr 2018

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

...read moreread less

3.3K

•Proceedings Article•10.21437/INTERSPEECH.2017-1452

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, +13 more

- 20 Aug 2017

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.

...read moreread less

1.7K

•Proceedings Article•10.1109/ICASSP.2019.8683143

Waveglow: A Flow-based Generative Network for Speech Synthesis

Ryan Prenger, +2 more

- 12 May 2019

TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.

...read moreread less

996

•Proceedings Article•10.1109/ICASSP40776.2020.9053795

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Ryuichi Yamamoto, +2 more

- 04 May 2020

TL;DR: Parallel WaveGAN as discussed by the authors is a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network, which can effectively capture the time-frequency distribution of the realistic speech waveform.

...read moreread less

726

...

Expand