TTS API Pricing in 2026: What You Actually Pay Per Million Characters

TLDR
TTS API pricing ranges from $4/million characters (AWS Polly Standard) to $165/million (ElevenLabs Studio models). The cheapest model is rarely the right model for every use case. At production scale, the real cost is not just per-character — it is retakes, validation failures, format conversion, and provider lock-in.
Picking a TTS provider used to mean one API key and one bill. In 2026, the market has 85+ models across a dozen pricing structures — per-character, per-minute, subscription credits, and token-based billing. The numbers look simple until you run 50,000 clips and discover your actual cost per output bears no resemblance to the advertised rate.
This guide cuts through the noise. Here is what the major TTS APIs actually charge, how to calculate your real costs, and what the price tag misses entirely.
How TTS APIs Charge You
Three billing models dominate the market:
Per-character pricing — You pay per 1,000 characters of input text. This is the clearest model for budgeting, since a given script has a known character count. Deepgram, Google Cloud TTS, and AWS Polly all use this structure.
Subscription credits — You buy a monthly plan that includes a pool of credits. Credits are consumed per character, with premium models costing more credits per character than standard models. ElevenLabs and Cartesia use this approach. Cost per character varies by model and plan tier.
Token-based pricing — OpenAI's newer TTS endpoints charge separately for input tokens and audio output tokens. This makes budgeting less intuitive, since audio token counts differ from character counts.
A fourth model — per-minute pricing — applies to real-time voice agents (Deepgram's Voice Agent API at $0.075/min, Cartesia's Line agents at $0.06/min). These are better compared on conversation-minute economics, not character cost.
TTS API Pricing Comparison: 2026
| Provider | Model | Price | Pricing Model |
|---|---|---|---|
| AWS Polly | Standard voices | $4/million chars | Per-character |
| xAI | Grok TTS | $4.20/million chars | Per-character |
| Deepgram | Aura-1 | $15/million chars | Per-character |
| AWS Polly | Neural voices | $16/million chars | Per-character |
| OpenAI | TTS-1-HD | ~$30/million chars | Per-character |
| Deepgram | Aura-2 | $30/million chars | Per-character |
| Cartesia | Sonic-3.5 (Pro) | ~$42/million chars | Subscription credits |
| Google Cloud | Neural2 / Studio | $60/million chars | Per-character |
| ElevenLabs | Flash / Turbo models | ~$80/million chars | Subscription credits |
| ElevenLabs | Standard / V2 models | ~$165/million chars | Subscription credits |
All figures reflect pay-as-you-go or base plan rates. Volume discounts apply at enterprise tiers. Character-to-credit conversion rates vary by model within subscription platforms.
Key observations:
- The cheapest options (AWS Polly Standard at $4/million, Grok TTS at $4.20/million) use older or lower-benchmark models. AWS Polly Standard voices score significantly below neural alternatives on naturalness benchmarks.
- Deepgram Aura-2 at $30/million chars offers a strong quality-to-cost ratio for real-time voice agent use cases, with 90ms latency.
- ElevenLabs pricing depends on which model you use. The Flash 2.5 model costs roughly half the credits of the standard V2 model — but targets low-latency agent use cases, not studio-grade narration.
- Google Cloud TTS Studio voices run $60/million but include a 1M character/month free tier, covering small-scale deployments entirely.
What You Actually Pay at Scale
The per-character rate is your starting point, not your final cost. Here is what 1 million characters actually costs across volume tiers, assuming zero retakes:
| Provider | Rate | 1M chars | 10M chars | 100M chars |
|---|---|---|---|---|
| Deepgram Aura-1 | $15/M | $15 | $150 | $1,500 |
| OpenAI TTS-1-HD | $30/M | $30 | $300 | $3,000 |
| Deepgram Aura-2 | $30/M | $30 | $300 | $3,000 |
| Cartesia Sonic-3.5 | ~$42/M | $42 | $420 | $4,200 |
| Google Cloud Neural2 | $60/M | $60 | $600 | $6,000 |
| ElevenLabs Flash | ~$80/M | $80 | $800 | $8,000 |
| ElevenLabs Standard | ~$165/M | $165 | $1,650 | $16,500 |
These numbers assume 100% first-pass acceptance. Production teams never see those conditions.
The Hidden Costs Nobody Quotes
The per-character rate answers one question: how much does it cost to generate a file? It does not answer the question your finance team eventually asks — how much did it cost to ship a file?
Retake economics. A 2–5% retake rate on 50,000 clips means 1,000–2,500 regenerations. Without automated quality scoring, every retake is manual: someone listens, rejects, re-submits. At even $0.50 per clip in labor, that is $500–$1,250 in unbudgeted QA cost on top of the API charges for re-runs. For a production-grade framework covering these dimensions, see Onepin's TTS quality validation checklist.
Mispronunciation failure rates. Brand names, product codes, and proper nouns fail at disproportionate rates. A 2% mispronunciation rate across a 10,000-clip audiobook production means 200 clips require review and re-generation. The API bill for the retakes is minor; the discovery and triage labor is not.
Format conversion. Not every TTS provider outputs the format your delivery target requires. Telephony pipelines typically need 8kHz G.711 audio. Video requires 44.1kHz PCM. If your provider outputs MP3 at 22kHz and you need 44.1kHz WAV, you add a conversion step and an engineering dependency to every output. ElevenLabs charges more for 192kbps/44.1kHz PCM output, available only on Pro and above.
Model version drift. TTS models update silently. A voice profile validated against ElevenLabs V2 in Q1 may sound subtly different on V2.5 in Q3. Without version-locking, your validated reference profiles drift. Re-validation at scale is a project, not a task.
Provider lock-in migration cost. The moment you switch providers — because a competitor undercuts on price, releases a better model, or changes pricing structure — you rebuild your entire integration. Validated voice profiles, SSML syntax, webhook logic, and audio delivery pipelines are all provider-specific. At 10,000 validated clips, a migration is a months-long project.
Choosing the Right TTS Provider for Your Volume
Volume and use case determine which rate matters:
Under 1M characters/month: Google Cloud's free tier covers you. For anything requiring higher quality, Deepgram Aura-1 or Cartesia's Pro plan offers strong economics at low volume.
1–50M characters/month: Deepgram Aura-2 ($30/M) or OpenAI TTS-1-HD ($30/M) offer the clearest cost visibility. ElevenLabs Flash works for agent applications where the lower credit cost matters more than maximum expressiveness.
Over 50M characters/month: Enterprise pricing from any major provider will undercut published rates substantially. AWS Polly Neural via reserved pricing is a common path for high-volume batch workloads. Google Cloud committed-use discounts apply at scale.
Multilingual production: The economics shift when you need dialect-specific voices. Some providers charge the same rate regardless of language; others charge premiums for specific locale models. For a detailed breakdown of multilingual pipeline construction, see the multilingual TTS pipeline developer guide.
Why Production Teams Route Across Multiple Models
No single model wins across all use cases on price and quality simultaneously.
A podcast network producing 20 shows does not want one voice for every host. A localization team processing 12 languages needs the best-performing model per locale, not the cheapest single provider. A voice agent developer needs sub-100ms latency for real-time conversations — but needs studio-grade output for IVR hold messages and on-hold prompts. The same team needs two different models and two different pricing structures, managed in one pipeline.
The answer is routing: selecting the right model for each job, validating the output, and shipping only clips that pass quality thresholds. That is what TTS orchestration means in practice.
Onepin sits above the model layer. It routes jobs to the appropriate TTS provider based on quality requirements, language, latency targets, and cost constraints — then validates each output before it ships. When a model fails, Onepin retries with an alternative. When a model updates silently, Onepin catches the drift before it reaches your audience.
The per-character rate is a starting point. The total cost of production — including retakes, validation, format handling, and migration risk — is what Onepin is built to reduce.
Ready to stop managing TTS providers and start shipping audio? Explore Onepin.