Overview
The Text-to-Speech API provides three methods for generating speech:- Streaming Audio Output: Receive PCM16 audio chunks as they’re generated
- Complete Audio File: Receive a full WAV file after generation completes
- WebSocket Streaming: Real-time bidirectional communication for streaming audio
Endpoints
POST /text-to-speech/:model_id (Complete Audio)
Generate speech from text and receive a complete WAV file.View Documentation
Complete WAV file generation endpoint
POST /text-to-speech/:model_id (Streaming)
Generate speech from text with streaming PCM16 audio chunks.View Documentation
Streaming PCM16 audio endpoint
WSS /text-to-speech
Real-time streaming via WebSocket connection. Connection:Example Usage
JavaScript/TypeScript
Python
Voice Settings
- stability: Controls the stability of the voice (0.0 to 1.0). Higher values produce more consistent output.
Best Practices
- Choose the right model: Select a model based on your latency and quality requirements
- Use streaming for long texts: Streaming reduces perceived latency for longer generations
- Handle errors gracefully: Always check for error responses and handle insufficient balance scenarios
- Cache models: Model information doesn’t change frequently, cache model lists to reduce API calls