
Overview of the (a) DreamVC, (b) DreamVG, and (c) Plugin Strategy.
KV represents Cross-Attention (Vaswani et al., 2017) and FiLM represents Feature-wise Linear Modulation layers (Perez et al., 2018) used for fusing Text Prompt and diffusion step t respectively.t is the diffusion step.c is the content embedding of the source speaker.s is the speaker embedding of the target voice.m is the mel-spectrogram.m_t and s_t represent the noisy versions of the mel-spectrogram and the speaker embedding at the diffusion step t.WIP Models and Checkpoints: DreamVC and DreamVGAvailable Details About the Dataset: DreamVoiceDB: Voice Timbre Dataset| Source | Prompt | DreamVC | DreamVG+ReDiffVC | DreamVG+FreeVC |
|---|---|---|---|---|
| A smooth young voice with a gender-neutral tone, that sounds cute. | ||||
| Authoritative sounding person, who is gender-ambiguous and adult. | ||||
| Senior's voice who can sound like a male or female with a smooth voice, perfect for storytelling. | ||||
| A female adult voice with a warm and bright voice, perfect for client and public interaction. | ||||
| A dark, smooth, and authoritative adult female voice, who sounds attractive and ideal for storytelling. | ||||
| A teenage girl's voice, characterized by brightness, smoothness, and nasal quality. | ||||
| Rough sounding atractive teenage girl with a voice suited for client and public interaction. | ||||
| A teenage girl's voice that is smooth, warm, and attractive, perfect for captivating storytelling. | ||||
| A senior female voice, dark, authoritative, and strong, ideal for diplomacy and judiciary roles. | ||||
| A senior woman's voice carries with warmth, depth, and an authoritative tone. | ||||
| Adult male voice, dark and smooth, authoritative and attractive. | ||||
| A mature male voice, bright and engaging, good for client and public interaction. | ||||
| Young boy with a bright, weak, and nasal voice. | ||||
| Teenager's voice that is rough and weak. | ||||
| A senior male voice, with a rough texture. |