Overview of the (a) DreamVC, (b) DreamVG, and (c) Plugin Strategy.
KV
represents Cross-Attention (Vaswani et al., 2017) and FiLM represents Feature-wise Linear Modulation layers (Perez et al., 2018) used for fusing Text Prompt and diffusion step t
respectively.t
is the diffusion step.c
is the content embedding of the source speaker.s
is the speaker embedding of the target voice.m
is the mel-spectrogram.m_t
and s_t
represent the noisy versions of the mel-spectrogram and the speaker embedding at the diffusion step t
.WIP
Models and Checkpoints: DreamVC and DreamVGAvailable
Details About the Dataset: DreamVoiceDB: Voice Timbre DatasetSource | Prompt | DreamVC | DreamVG+ReDiffVC | DreamVG+FreeVC |
---|---|---|---|---|
A smooth young voice with a gender-neutral tone, that sounds cute. | ||||
Authoritative sounding person, who is gender-ambiguous and adult. | ||||
Senior's voice who can sound like a male or female with a smooth voice, perfect for storytelling. | ||||
A female adult voice with a warm and bright voice, perfect for client and public interaction. | ||||
A dark, smooth, and authoritative adult female voice, who sounds attractive and ideal for storytelling. | ||||
A teenage girl's voice, characterized by brightness, smoothness, and nasal quality. | ||||
Rough sounding atractive teenage girl with a voice suited for client and public interaction. | ||||
A teenage girl's voice that is smooth, warm, and attractive, perfect for captivating storytelling. | ||||
A senior female voice, dark, authoritative, and strong, ideal for diplomacy and judiciary roles. | ||||
A senior woman's voice carries with warmth, depth, and an authoritative tone. | ||||
Adult male voice, dark and smooth, authoritative and attractive. | ||||
A mature male voice, bright and engaging, good for client and public interaction. | ||||
Young boy with a bright, weak, and nasal voice. | ||||
Teenager's voice that is rough and weak. | ||||
A senior male voice, with a rough texture. |