Jun 23, 2026

TTS API Pricing in 2026: What You Actually Pay Per Million Characters

TLDR

TTS API pricing ranges from $4/million characters (AWS Polly Standard) to $165/million (ElevenLabs Studio models). The cheapest model is rarely the right model for every use case. At production scale, the real cost is not just per-character — it is retakes, validation failures, format conversion, and provider lock-in.

Picking a TTS provider used to mean one API key and one bill. In 2026, the market has 85+ models across a dozen pricing structures — per-character, per-minute, subscription credits, and token-based billing. The numbers look simple until you run 50,000 clips and discover your actual cost per output bears no resemblance to the advertised rate.

This guide cuts through the noise. Here is what the major TTS APIs actually charge, how to calculate your real costs, and what the price tag misses entirely.

How TTS APIs Charge You

Three billing models dominate the market:

Per-character pricing — You pay per 1,000 characters of input text. This is the clearest model for budgeting, since a given script has a known character count. Deepgram, Google Cloud TTS, and AWS Polly all use this structure.

Subscription credits — You buy a monthly plan that includes a pool of credits. Credits are consumed per character, with premium models costing more credits per character than standard models. ElevenLabs and Cartesia use this approach. Cost per character varies by model and plan tier.

Token-based pricing — OpenAI's newer TTS endpoints charge separately for input tokens and audio output tokens. This makes budgeting less intuitive, since audio token counts differ from character counts.

A fourth model — per-minute pricing — applies to real-time voice agents (Deepgram's Voice Agent API at $0.075/min, Cartesia's Line agents at $0.06/min). These are better compared on conversation-minute economics, not character cost.

TTS API Pricing Comparison: 2026

Provider	Model	Price	Pricing Model
AWS Polly	Standard voices	$4/million chars	Per-character
xAI	Grok TTS	$4.20/million chars	Per-character
Deepgram	Aura-1	$15/million chars	Per-character
AWS Polly	Neural voices	$16/million chars	Per-character
OpenAI	TTS-1-HD	~$30/million chars	Per-character
Deepgram	Aura-2	$30/million chars	Per-character
Cartesia	Sonic-3.5 (Pro)	~$42/million chars	Subscription credits
Google Cloud	Neural2 / Studio	$60/million chars	Per-character
ElevenLabs	Flash / Turbo models	~$80/million chars	Subscription credits
ElevenLabs	Standard / V2 models	~$165/million chars	Subscription credits

All figures reflect pay-as-you-go or base plan rates. Volume discounts apply at enterprise tiers. Character-to-credit conversion rates vary by model within subscription platforms.

Key observations:

The cheapest options (AWS Polly Standard at $4/million, Grok TTS at $4.20/million) use older or lower-benchmark models. AWS Polly Standard voices score significantly below neural alternatives on naturalness benchmarks.
Deepgram Aura-2 at $30/million chars offers a strong quality-to-cost ratio for real-time voice agent use cases, with 90ms latency.
ElevenLabs pricing depends on which model you use. The Flash 2.5 model costs roughly half the credits of the standard V2 model — but targets low-latency agent use cases, not studio-grade narration.
Google Cloud TTS Studio voices run $60/million but include a 1M character/month free tier, covering small-scale deployments entirely.

What You Actually Pay at Scale

The per-character rate is your starting point, not your final cost. Here is what 1 million characters actually costs across volume tiers, assuming zero retakes:

Provider	Rate	1M chars	10M chars	100M chars
Deepgram Aura-1	$15/M	$15	$150	$1,500
OpenAI TTS-1-HD	$30/M	$30	$300	$3,000
Deepgram Aura-2	$30/M	$30	$300	$3,000
Cartesia Sonic-3.5	~$42/M	$42	$420	$4,200
Google Cloud Neural2	$60/M	$60	$600	$6,000
ElevenLabs Flash	~$80/M	$80	$800	$8,000
ElevenLabs Standard	~$165/M	$165	$1,650	$16,500

These numbers assume 100% first-pass acceptance. Production teams never see those conditions.

The Hidden Costs Nobody Quotes

The per-character rate answers one question: how much does it cost to generate a file? It does not answer the question your finance team eventually asks — how much did it cost to ship a file?

Retake economics. A 2–5% retake rate on 50,000 clips means 1,000–2,500 regenerations. Without automated quality scoring, every retake is manual: someone listens, rejects, re-submits. At even $0.50 per clip in labor, that is $500–$1,250 in unbudgeted QA cost on top of the API charges for re-runs. For a production-grade framework covering these dimensions, see Onepin's TTS quality validation checklist.

Mispronunciation failure rates. Brand names, product codes, and proper nouns fail at disproportionate rates. A 2% mispronunciation rate across a 10,000-clip audiobook production means 200 clips require review and re-generation. The API bill for the retakes is minor; the discovery and triage labor is not.

Format conversion. Not every TTS provider outputs the format your delivery target requires. Telephony pipelines typically need 8kHz G.711 audio. Video requires 44.1kHz PCM. If your provider outputs MP3 at 22kHz and you need 44.1kHz WAV, you add a conversion step and an engineering dependency to every output. ElevenLabs charges more for 192kbps/44.1kHz PCM output, available only on Pro and above.

Model version drift. TTS models update silently. A voice profile validated against ElevenLabs V2 in Q1 may sound subtly different on V2.5 in Q3. Without version-locking, your validated reference profiles drift. Re-validation at scale is a project, not a task.

Provider lock-in migration cost. The moment you switch providers — because a competitor undercuts on price, releases a better model, or changes pricing structure — you rebuild your entire integration. Validated voice profiles, SSML syntax, webhook logic, and audio delivery pipelines are all provider-specific. At 10,000 validated clips, a migration is a months-long project.

Choosing the Right TTS Provider for Your Volume

Volume and use case determine which rate matters:

Under 1M characters/month: Google Cloud's free tier covers you. For anything requiring higher quality, Deepgram Aura-1 or Cartesia's Pro plan offers strong economics at low volume.

1–50M characters/month: Deepgram Aura-2 ($30/M) or OpenAI TTS-1-HD ($30/M) offer the clearest cost visibility. ElevenLabs Flash works for agent applications where the lower credit cost matters more than maximum expressiveness.

Over 50M characters/month: Enterprise pricing from any major provider will undercut published rates substantially. AWS Polly Neural via reserved pricing is a common path for high-volume batch workloads. Google Cloud committed-use discounts apply at scale.

Multilingual production: The economics shift when you need dialect-specific voices. Some providers charge the same rate regardless of language; others charge premiums for specific locale models. For a detailed breakdown of multilingual pipeline construction, see the multilingual TTS pipeline developer guide.

Why Production Teams Route Across Multiple Models

No single model wins across all use cases on price and quality simultaneously.

A podcast network producing 20 shows does not want one voice for every host. A localization team processing 12 languages needs the best-performing model per locale, not the cheapest single provider. A voice agent developer needs sub-100ms latency for real-time conversations — but needs studio-grade output for IVR hold messages and on-hold prompts. The same team needs two different models and two different pricing structures, managed in one pipeline.

The answer is routing: selecting the right model for each job, validating the output, and shipping only clips that pass quality thresholds. That is what TTS orchestration means in practice.

Onepin sits above the model layer. It routes jobs to the appropriate TTS provider based on quality requirements, language, latency targets, and cost constraints — then validates each output before it ships. When a model fails, Onepin retries with an alternative. When a model updates silently, Onepin catches the drift before it reaches your audience.

The per-character rate is a starting point. The total cost of production — including retakes, validation, format handling, and migration risk — is what Onepin is built to reduce.

Ready to stop managing TTS providers and start shipping audio? Explore Onepin.