DiffWave: A Versatile Diffusion Model for Audio Synthesis.
Zhifeng Kong,Wei Ping,Jiaji Huang,Kexin Zhao,Bryan Catanzaro +4 more
174
TL;DR: DiffWave is a non-autoregressive diffusion model for audio synthesis, producing high-fidelity audios in various tasks, outperforming WaveNet vocoder in speech quality and outperforming autoregressive and GAN-based models in unconditional generation.
read more
Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
AudioLM: A Language Modeling Approach to Audio Generation
Zalán Borsos,Raphaël Marinier,Damien Vincent,Eugene Kharitonov,Olivier Pietquin,Matthew Sharifi,Dominik Roblek,Olivier Teboul,David Grangier,Marco Tagliasacchi,Neil Zeghidour +10 more
TL;DR: The proposed hybrid tokenization scheme leverages the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.
366
Video Diffusion Models
07 Apr 2022
TL;DR: The authors proposed a diffusion model for video generation, which is a natural extension of the standard image diffusion architecture and enables jointly training from image and video data, which they find to reduce the variance of minibatch gradients and speed up optimization.
246
ModelScope Text-to-Video Technical Report
Jiuniu Wang,Hangjie Yuan,Dayou Liu Jianzhong Chen,Yingya Zhang,Xiang Wang,Shiwei Zhang +5 more
TL;DR: The ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions and demonstrates superior performance over state-of-the-art methods across three evaluation metrics.
205
ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech
Rongjie Huang,Zhou Zhao,Huadai Liu,Jinglin Liu,Chenye Cui,Yi Ren +5 more
- 13 Jul 2022
TL;DR: ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling, and enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time.
Listen, denoise, action! Audio-driven motion synthesis with diffusion models
TL;DR: In this article , the authors adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power, and demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression.
References
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie,Ross Girshick,Piotr Dollár,Zhuowen Tu,Kaiming He +4 more
- 21 Jul 2017
TL;DR: ResNeXt as discussed by the authors is a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology.
11.2K
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
Jonathan Shen,Ruoming Pang,Ron Weiss,Mike Schuster,Navdeep Jaitly,Zongheng Yang,Zhifeng Chen,Yu Zhang,Yuxuan Wang,Rj Skerrv-Ryan,Rif A. Saurous,Yannis Agiomvrgiannakis,Yonghui Wu +12 more
- 15 Apr 2018
TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
3.3K
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang,RJ Skerry-Ryan,Daisy Stanton,Yonghui Wu,Ron Weiss,Navdeep Jaitly,Zongheng Yang,Ying Xiao,Zhifeng Chen,Samy Bengio,Quoc V. Le,Yannis Agiomyrgiannakis,Robert A. J. Clark,Rif A. Saurous +13 more
- 20 Aug 2017
TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.
1.7K
Waveglow: A Flow-based Generative Network for Speech Synthesis
Ryan Prenger,Rafael Valle,Bryan Catanzaro +2 more
- 12 May 2019
TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.
996
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
Ryuichi Yamamoto,Eunwoo Song,Jae-Min Kim +2 more
- 04 May 2020
TL;DR: Parallel WaveGAN as discussed by the authors is a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network, which can effectively capture the time-frequency distribution of the realistic speech waveform.
726