DreamVoiceConversion

💭 DreamVoice: Text-Guided Voice Generation and Conversion

Model Architecture

Overview of the (a) DreamVC, (b) DreamVG, and (c) Plugin Strategy.

Modules in blue are pre-trained models and remain frozen during training, while modules in yellow are trained.
Green blocks represent the source speaker information while red blocks represent the target speaker information.
Purple blocks correspond to the converted speech.
Dashed lines represent skip connections.
LM represents the Language Model.
KV represents Cross-Attention (Vaswani et al., 2017) and FiLM represents Feature-wise Linear Modulation layers (Perez et al., 2018) used for fusing Text Prompt and diffusion step t respectively.
SDE solver is the stochastic differential equations for the diffusion sampling.
Text Prompt is the text description about the desired target voice.
t is the diffusion step.
c is the content embedding of the source speaker.
s is the speaker embedding of the target voice.
m is the mel-spectrogram.
m_t and s_t represent the noisy versions of the mel-spectrogram and the speaker embedding at the diffusion step t.

😊 WIP Models and Checkpoints: DreamVC and DreamVG
- code and checkpoints will be released based on the acceptance decision
💻 Available Details About the Dataset: DreamVoiceDB: Voice Timbre Dataset

Source	Prompt	DreamVC	DreamVG+ReDiffVC	DreamVG+FreeVC
	A smooth young voice with a gender-neutral tone, that sounds cute.
	Authoritative sounding person, who is gender-ambiguous and adult.
	Senior's voice who can sound like a male or female with a smooth voice, perfect for storytelling.
	A female adult voice with a warm and bright voice, perfect for client and public interaction.
	A dark, smooth, and authoritative adult female voice, who sounds attractive and ideal for storytelling.
	A teenage girl's voice, characterized by brightness, smoothness, and nasal quality.
	Rough sounding atractive teenage girl with a voice suited for client and public interaction.
	A teenage girl's voice that is smooth, warm, and attractive, perfect for captivating storytelling.
	A senior female voice, dark, authoritative, and strong, ideal for diplomacy and judiciary roles.
	A senior woman's voice carries with warmth, depth, and an authoritative tone.
	Adult male voice, dark and smooth, authoritative and attractive.
	A mature male voice, bright and engaging, good for client and public interaction.
	Young boy with a bright, weak, and nasal voice.
	Teenager's voice that is rough and weak.
	A senior male voice, with a rough texture.