API
...
Configuration Parameters
Voice Selection Guide [TTS]
9 min
overview the unith interface supports multiple text to speech (tts) providers to give your digital human a natural, engaging voice you can select voices from elevenlabs or microsoft azure directly through the interface, or integrate custom voice providers using our connector framework please check our https //docs unith ai/voice connectors on voice connectors that we support need a different voice provider? you have full flexibility to create custom voice connectors please check out the following https //github com/unith ai/voice connector template how voice selection impacts performance digital human responses require audio generation before video synthesis can begin the audio generation speed directly affects the overall response time of your digital human response pipeline user query processed audio generated ← voice model speed matters here video synthesized from audio complete response delivered faster audio generation means quicker responses and a more natural, engaging user experience recommended voices by provider elevenlabs elevenlabs offers a wide variety of voices powered by different models, each optimized for specific use cases for digital human applications, we recommend using voices powered by their speed optimized models recommended models model characteristics use case flash v2 fastest generation, balanced quality real time conversations flash v2 5 enhanced flash model real time conversations with improved quality turbo v2 high speed generation low latency interactions turbo v2 5 latest turbo generation optimal balance of speed and quality best practice select elevenlabs voices that use flash v2, flash v2 5, turbo v2, or turbo v2 5 models for the fastest digital human response times important notes all elevenlabs models will function correctly with digital humans non optimized models may result in longer response delays speed optimized models are specifically designed for real time conversational applications for a complete list of available voices and their associated models, https //elevenlabs io/docs/overview/capabilities/voices microsoft azure microsoft azure offers an extensive voice catalog across multiple performance tiers for optimal digital human performance, we recommend selecting voices from their speed optimized tiers recommended voice types select voices that include one of these identifiers in their name voice type identifier language support performance turbo multilingual turbomultilingual 40+ languages fastest generation across multiple languages hd flash hdflash english (us), chinese (mandarin) very fast with high definition quality voices to avoid avoid voices containing hdneural in their name, as these prioritize audio quality over generation speed and will result in longer response times azure voice performance tiers the table below provides an overview of microsoft azure's voice catalog organized by performance characteristics performance tier language coverage available voices recommended turbo english (us) only 7 ✅ yes fastest option hd flash english (us), mandarin chinese 10 ✅ yes fast with hd quality multilingual 40+ languages 52 ✅ yes best for multilingual applications hd neural limited (10 15 languages) 54 ⚠️ not recommended slower generation standard neural 150+ locales 500+ ⚠️ mixed performance for multilingual digital humans, prioritize voices with turbomultilingual in their name to maintain fast response times across all supported languages for the complete azure voice catalog and detailed specifications, visit the https //learn microsoft com/en us/azure/ai services/speech service/index text to speech voice selection best practices prioritize speed optimized models choose voices specifically designed for low latency applications test before deploying always test selected voices with your digital human to ensure they meet your quality and performance requirements consider your audience balance response speed with voice quality based on your use case language requirements if you need multilingual support, select voices that cover all required languages while maintaining performance