Text to Speech for IVR: The Production Guide Nobody Writes

Most teams building a phone system treat text-to-speech as a solved problem: pick a voice, call the API, wire it into the call flow. Then the IVR ships, callers report muffled prompts, mispronounced account numbers, and inconsistent volume between menu options, and nobody can explain why the same model that sounded great on the demo sounds wrong on a live call.
The gap isn't the model. It's everything a phone network does to audio between generation and a caller's ear — and almost every TTS-for-IVR guide skips that part.
Why IVR Audio Is a Different Problem Than Web or App Audio
Web and app audio plays back at whatever sample rate you generate it at — usually 22kHz to 48kHz, full-band, plenty of headroom for detail. Phone networks don't work that way.
Traditional telephony runs on the G.711 codec at an 8kHz sample rate, a narrowband standard built for voice-band frequencies between 300Hz and 3,400Hz — a limitation that traces back to analog phone lines and still defines most carrier infrastructure today, IP-based or not (HOLDCOM, cloud.ax). Generate a prompt at 44.1kHz, hand it to a SIP trunk or PBX without converting it, and you're relying on the carrier's own down-sampling, which is inconsistent across vendors and frequently strips clarity from consonants and sibilants exactly where callers need precision most — account numbers, dates, spelled names.
Get the audio format wrong and it doesn't fail loudly. It just sounds bad, callers hang up or mis-key their input, and the ticket that reaches engineering says "IVR sounds robotic" with no diagnostic path back to the actual cause.
Four Things That Break Between Generation and Playback
Sample rate and codec mismatch. Most TTS APIs default to 22kHz-48kHz WAV or MP3. IVR platforms — Twilio, Amazon Connect, Genesys, Asterisk-based PBX systems — expect 8kHz PCM u-law or a-law, commonly called G.711 (Yeastar). Skip the conversion and you inherit whatever the platform does by default, which varies vendor to vendor.
Silence handling. IVR platforms expect precise silence padding at the start and end of prompts to avoid clipped words or awkward gaps between menu options. TTS models don't generate telephony-aware silence by default — that's a post-processing step, not a model setting.
Loudness normalization. A prompt recorded at -16 LUFS and a prompt recorded at -23 LUFS play back at noticeably different volumes on the same call, and callers notice the jump between "press 1" and the next prompt even when both used the same voice.
Model version drift. IVR prompt libraries are built once and expected to sound identical for years. TTS providers update models on their own schedule. An unannounced model update to your provider's default endpoint can shift pacing or pronunciation across your entire prompt library overnight, with no changelog and no rollback path unless you locked the model version yourself.
What "Production-Ready" IVR TTS Actually Requires
| Requirement | Why It Matters for IVR | What Gets Missed |
|---|---|---|
| 8kHz G.711 encoding | Matches carrier network format | Teams ship native 22kHz+ output uncoverted |
| Silence padding rules | Prevents clipped or run-on prompts | Treated as a platform setting, not validated per prompt |
| Loudness normalization | Keeps volume consistent across menu options | Left to default TTS output levels |
| Model version lock | Prevents pacing/pronunciation drift on updates | No changelog tracking, no re-validation trigger |
| Pronunciation QA on variables | Account numbers, names, dates read correctly | Only tested on sample scripts, not edge-case inputs |
| Per-prompt audit trail | Debugging without re-recording everything | No record of which model version generated which file |
None of this is exotic. It's the difference between a TTS API call and a validated telephony-audio pipeline — and most teams find out the difference exists only after the IVR is live and complaints start coming in.
Building This Yourself vs. Using an Orchestration Layer
You can build the conversion, validation, and version-locking pipeline in-house: a down-sampling step, a loudness normalization pass, a pronunciation test suite for variables, a changelog tracker for whichever TTS provider you use. Teams do this today. It's also ongoing maintenance work that has nothing to do with your actual product, and it has to be rebuilt every time you add a language, switch providers, or a provider silently updates a model underneath you.
Onepin sits above the model layer specifically to remove this work. It orchestrates output across 100+ TTS models, validates every generated prompt against format, pronunciation, and consistency requirements before it ships, retries automatically when a prompt fails validation, and keeps a version-locked, audited record of what generated every file in your prompt library. You're not locked into one model's telephony quirks, and you're not rebuilding the validation layer from scratch when a provider changes something on their end.
If your IVR prompts need to sound the same on caller 1 and caller 50,000, that consistency has to be engineered — the model alone won't guarantee it.
Get Started
Ready to stop debugging muffled IVR prompts one support ticket at a time? Talk to Onepin about validated, telephony-ready TTS output across every model you already use.