You Can Now Describe Any Voice You Want. Your Pipeline Still Can't Guarantee It Ships.

A New Kind of TTS Promise
Indian AI startup Rumik launched Silk Mulberry 1.5, and the pitch is genuinely different: type a description (a Haryanvi man in his 30s, low pitch, code-switching Hindi-English) and the model builds a voice to match. MOS 4.23, sub-200ms TTFA on a single H100 at 80 concurrent requests, within range of ElevenLabs v3 and Gemini 3.1 Flash TTS. A real step forward for voice design.
The Number That Changes Everything at Scale
Silk Mulberry 1.5 scores 74% overall on InstructTTS Eval — measuring how consistently output matches the voice description. On abstract role-play it drops to 64.4%. In a demo, 74% sounds strong. In production, it means 1 in 4 outputs does not match the voice you specified. Run 10,000 clips — a modest volume for an e-learning company — and you get approximately 2,600 clips that sound wrong. None fail automatically. Cartesia, MiniMax, and every TTS provider faces the same gap between generation quality and specification compliance at scale. The industry has not solved it at the model level.
What Closes the Gap
Production-grade voice delivery requires an output validation layer above the model: per-clip compliance scoring against spec, automated retry with model routing, model version locking so silent updates don't shift baselines, and a full audit trail per run. Onepin validates every output against your voice specification, routes retries automatically, and prevents non-compliant clips from shipping across 100+ TTS models. The model generates. The pipeline makes sure what ships matches what you asked for. See how Onepin handles voice spec compliance at production scale.