The new race to get AI to talk back to you
Commercializing text-to-speech is now a duel between OpenAI and startups—and a fresh unicorn is in the mix.
A new race in the AI industry is starting to quietly heating up in tandem with the jockeying amongst OpenAI, Google, and others to make the Biggest Model Of All: getting our phones to talk back to us.
While practitioners and hobbyists dissect and develop every possible permutation of existing models, text-to-speech—giving that LLM a real voice—has been a quietly expensive and formidably challenging tool to build. OpenAI made an aggressive move into it with the launch of its text-to-speech product in November. And several startups emerged recently tackling it directly as an independent product, including Play.ht and ElevenLabs.
The latter, ElevenLabs, announced its long-expected funding round today, raising $80 million in a round co-led by Andreessen Horowitz and prolific AI investors Nat Friedman and Daniel Gross. The round also includes Sequoia, Smash Capital, SV Angel, BroadLight Capital and Credo Ventures. The round values the company at more than $1 billion, confirming my earlier report on it as OpenAI made its first push into text-to-speech.)
ElevenLabs launched in 2022, but the march toward making text-to-speech more widely accessible kicked off in earnest in November last year with the release of OpenAI’s text-to-speech AI. And while OpenAI is dealing with LLM developers, image generation challengers like Midjourney, and better embeddings providers like Voyage and Cohere, it’s now opened itself up on a fourth front for challenger startups in text-to-speech.
Commercially-viable text-to-speech tools are still in their relatively infancy. In very OpenAI fashion, it created a comically simple API that’s cheap for a lot of use cases, but doesn’t include features like voice cloning. But for ElevenLabs, it’s building out a whole operation around figuring out what TTS should look like within AI products in the first place.
“OpenAI is one of the biggest rivals, sure, because they have the tech capability to deliver that to an already huge base of developers for commercial work,” ElevenLabs co-founder Mati Staniszewski told me. “But we want to focus on audio as a whole—on research and product. We develop our own research, with quality at the forefront, and we focus on how to make it controllable and build our products around it. OpenAI is focused on delivering that audio experience as part of the chat experience, which is a very specific use case.”
Text-to-speech has quietly established itself as not only viable, but actually commercializable. And while the use cases are largely sequestered in the realm of entertainment for now, that class of use cases is only growing over time. Some examples were users converting what they were saying in different languages, as well as emerging use cases for the vision-impaired community.
There’s also the obvious use case: replicate the AI assistant we’ve seen in Sci-Fi for decades but never expected to just drop onto our phones overnight just a few weeks before we all broke for the Thanksgiving holidays.
And then there’s the last part of the whole text-to-speech component: what happens to these tools as we move closer to a reality where models exist locally on devices, and what do those experiences actually look like? And it turns out Staniszewski has that in mind, too.
Different approaches to text-to-speech
ElevenLabs’ business is essentially around finding a balance, Staniszewski tells me. The startup’s challenge is finding an optimal point between sound quality, latency, availability, and of course, price. For now—and this isn’t just restricted to text-to-speech—each of those comes at the expense of the other.
“As the model gets bigger [for better fidelity], the latency you can deliver is not as good any more,” Staniszewski said. “There’s the tradeoff for quality versus speed. We could deliver a small model very quick, and where does that threshold lie where users say they enjoy speaking to it and listening to it. And as we continually add more models, you need the GPUs to cater to all the users. Theres a benefit given our scale, we can effectively serve users and the peaks are covered. But different users will use different models, so you could have some constraints on GPUs.”
The challenge has made text-to-speech largely difficult to grow at the epic scales of a ChatGPT. OpenAI in November, however, announced that it would both launch a text-to-speech API, and add it as a feature to all existing ChatGPT users. It’s a pretty classic OpenAI strategy: tease developers with the consumer app, then try to crush everyone on price after you’ve won them over.
OpenAI is priced aggressively against both Play.ht and ElevenLabs. But with OpenAI, it’s important to remember that there are usage limits for non-enterprise customers. OpenAI’s top tier, which requires customers spending $1,000 per month on APIs, allows a maximum spend of $10,000 per month. This effectively caps usage of its text-to-speech product, which could also be running concurrently with its GPT-series models unless a customer goes for an enterprise plan, which doesn’t have disclosed pricing caps on its pricing pages.
When you look at the head-to-head price comparisons (we’re looking at the highest self-serve paid tiers and not custom enterprise plans, just to try to get to a ballpark here), OpenAI represents a fraction of what Play.ht and ElevenLabs charge—but they also offer what you could argue is the bare-minimum API product here against the backdrop of the aggressive demand OpenAI faces broadly for its products.