An Indian AI startup has quietly pulled off something most voice-tech companies struggle to do: win over listeners without the power of a global brand.
This week, Sarvam AI revealed that its latest text-to-speech model, Bulbul V3, outperformed well-known global competitors in a large blind listening test—an outcome that’s already turning heads among developers and speech-tech insiders.
A test where branding didn’t matter
The evaluation, conducted with Josh Talks, asked more than 500 participants to compare short audio clips generated by different TTS models. Crucially, listeners weren’t told which company made which voice.
Bulbul V3 consistently came out on top. In telephony-style audio—arguably one of the hardest environments for synthetic speech—the model achieved listener preference rates as high as 77.95%. Participants also flagged fewer pronunciation mistakes and fewer skipped words compared with samples from ElevenLabs and Cartesia.
For a startup competing against heavily funded voice platforms, that result carries weight.
Built for how people actually speak
Sarvam AI says Bulbul V3’s edge comes from focusing less on studio-perfect demos and more on everyday speech. That includes code-mixed language like Hinglish, rapid number sequences, and proper nouns—details that often trip up speech models in customer support and education settings.
Co-founder Pratyush Kumar has pointed to these real-world scenarios as the company’s north star. Call centers, IVR systems, and edtech platforms don’t need poetic voices, he argues; they need clarity, consistency, and trust.
Listeners in the blind test seemed to agree.
Why telephony matters more than demos
Many modern TTS systems sound impressive in clean, high-bitrate audio. Things change when compression, noise, and low bandwidth enter the picture.
That’s where Bulbul V3 appears to have separated itself. Test participants repeatedly favored its output in phone-quality audio, suggesting the model was trained—or at least tuned—with these constraints in mind.
For enterprises running large-scale voice systems, that’s not a niche win. It’s often the deciding factor.
Developers are already experimenting
Sarvam AI has made Bulbul V3 available for free testing through its dashboard, a move that’s helped the model spread quickly among developers.
Early feedback highlights how “human” the voices sound over longer passages and how reliably the model handles mixed-language text. While this isn’t a full market launch yet, the interest suggests Bulbul V3 is being evaluated for production use, not just curiosity clicks.
A signal beyond one model
The bigger story isn’t just that Bulbul V3 won a blind test. It’s what that win represents.
India is one of the most linguistically complex markets in the world. Models that perform well here tend to be resilient elsewhere. If a relatively young startup can outperform established players on listening preference—arguably the metric that matters most—it challenges long-held assumptions about where innovation in AI voice comes from.
It also raises an uncomfortable question for global TTS leaders: are benchmarks and brand recognition starting to matter less than real-world listening tests?
What comes next
Sarvam AI hasn’t disclosed pricing or enterprise partnerships yet, but the company says Bulbul V3 is only the beginning. Expect broader language coverage, more domain-specific voices, and tighter integrations with business platforms.
For now, Bulbul V3’s blind-test win stands as a reminder that in AI voice, the final judge isn’t a leaderboard—it’s the human ear.
And this time, it chose a new name.