anonymous-submission

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Author: anonymized

Paper: anonymized
Code: GitHub

Abstract Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced.

Figure 1: Comparison of a typical time-domain GAN vocoder (a), with the proposed Vocos architecture (b) that maintains the same temporal resolution across all layers. Time-domain vocoders use transposed convolutions to sequentially upsample the signal to the desired sample rate. In contrast, Vocos achieves this by using a computationally efficient inverse Fourier transform.

Resynthesis from neural audio codec (EnCodec)

1.5 kbps

Ground truth	EnCodec	Vocos

3 kbps

Ground truth	EnCodec	Vocos

6 kbps

Ground truth	EnCodec	Vocos

12 kbps

Ground truth	EnCodec	Vocos

Resynthesis from mel-spectrograms

LibriTTS `test-clean`

Ground truth	HiFi-GAN	BigVGAN	iSTFTNet	Vocos

LibriTTS `test-other`

Ground truth	HiFi-GAN	BigVGAN	iSTFTNet	Vocos

Audio reconstruction from Bark tokens

Sequence of tokens generated with Bark text-to-audio model: https://github.com/suno-ai/bark

Text prompt	EnCodec	Vocos
So, you've heard about neural vocoding? [laughs] We've been messing around with this new model called Vocos.
Ok [clears throat] let's compare the audio outputs. Listen carefully to the differences in each sample's quality and artifacts.
My friend’s bakery burned down last night. [sighs] Now his business is toast.
Schweinsteiger ist ein nationales kulturgut. Wir müssen ihn um jeden preis schützen.
Polecam odwiedzenie Starego Miasta w Szczecinie! Architektura jest piękna, a lokalna kuchnia doskonała!
我计划在下周的游泳比赛中和我的朋友托尼比赛。他认为自己可以打败我，但他不知道我一直在浴缸里偷偷练习游泳。我不敢说我会赢，但我很确定我会搞出一片浪花。
Bonjour. Aujourd’hui, nous sommes içi pour manger trop de glace.
हॉटस्टार पर रुद्र सबसे बेहतरीन शो है! कहानी बेहद शानदार है, और अजय देवगन बहुत खूबसूरत लगते हैं।
¿Estos payasos llamaron a su modelo como un ladrido de perro? [laughs] ¿En serio?
추석은 내가 가장 좋아하는 명절이다. 나는 며칠 동안 휴식을 취하고 친구 및 가족과 시간을 보낼 수 있습니다