AI in February: Data firehoses get a price tag, Google face-plants, and eyes turn to chunking
Plus: Mistral's new model and that Groq thing.
Update 3/8: The flu is currently ripping through the household so issues will resume next week. Paid subscribers will receive another comped week.
Happy March! I hope we’re all very excited for daylight savings to start after the leap day, as usual, broke a bunch of crap.
The first quarter of the year has already been alarmingly packed with announcements, launches, and ongoing shouting over benchmarks. In particular we’re starting to see increasing momentum in the launches and announcements from OpenAI and Google, though in seemingly the opposite direction.
Honestly this tweet (sorry, post) from Runway CEO Cristóbal Valenzuela really summed everything up:
Send help, and coffee, please.
Now that I’m back from leave we can get to addressing all the craziness that happened this month and brace for what’s coming next month. So, let’s talk about some of the bigger stories in February, as well as what I’m starting to see on the horizon:
OpenAI and Google do some stuff: OpenAI is breaking into yet another model mode with the release-ish of its text-to-video model, Sora. Meanwhile Google continues to not figure its stuff out with Gemini, much to the benefit of others—including perplexity. However…
Google makes the first move on open source: Google also releases two pocket models, Gemma, in an overture to the open source community that its closed-model peers have yet to do. Maybe it would have been better for both to happen in reverse order with the above, but we’ll take what we can get.
Data sets get price tags: After companies quickly shut off access to data firehoses for training it was really only a matter of time they opened up with a toll fee. We’re starting to see that emerge with the launch of a handful of APIs and deals.
Unstructured data gets a closer look: Now that retrieval augmented generation (RAG) is becoming the standard and embeddings models are improving, some are starting to look closer at all the other parts on the left side of a vector database pipeline. That includes two important steps: pre-processing and chunking.
With that, let’s get started!
Text-to-video has its ElevenLabs moment
OpenAI and Google once again blew up the internet this month with their work—some of it for good reasons, and some for… not great reasons. And once again it highlights a lot of opportunities for a handful of startups that are first-movers in the space.
In the spirit of not being total downers about the incumbents in AI, let’s start with the good.
OpenAI finally unveiled its work on its text-to-video model on, naturally, the platform-formerly-known-as-twitter. In it Altman and company posted some impressive-looking semi-long clips, lighting up the platform that… has already seen many impressive-looking text-to-video clips developed by startups.
Those startups would be Runway and Pika. The former, backed by Lux Capital, Felicis Ventures, Coatue, Google and Nvidia, already has a text-to-video product live that you don’t have to tweet at Sam Altman to get a video clip. Pika, meanwhile, is backed by Elad Gil (who, once again, is everywhere), Lightspeed Venture Partners, and Sequoia. You can, again, use Pika right now (which funnily uses the same Midjourney Discord login).
OpenAI is once again a late-mover in this space, like it was with text-to-speech, but the product it turns out is quite good and of the OpenAI pedigree we’ve come to expect. But without getting our hands on it, we don’t know exactly what we’re getting.
In the text-to-speech example, OpenAI’s product is pretty bare-bones, but it’s also cheap. But its main challenger startup, ElevenLabs, has a much more robust product built around its text-to-speech tool. Even if it’s more expensive, ElevenLabs was still able to raise at more than a $1 billion valuation. Runway, meanwhile, was last valued at $1.5 billion.
In that way we’ve lately been starting to see OpenAI really have a kind of Apple-like veneer to it—it’s usually not the first, but it comes out with something best-in-class. (And, funny enough, longtime Apple alumnus Steve Dowling has a stint at OpenAI on his LinkedIn.)
What remains to be seen is the pricing model involved here and whether it will hew to OpenAI’s typical strategy of trying to win on price by just undercutting everyone. But that strategy is starting to show some wear-and-tear, most recently in the launch of its new under-cutting embeddings models that were themselves under-cut within 24 hours.
Google’s the-good-then-the-bad month
Continuing on with the good-ish, Google released two open source pocket models, one coming in at 2 billion parameters and the other at a Mistral-ish 7 billion. Again, fine, another open source model where we can all argue about benchmarks and whatnot. But that’s not the point.
Google, though it seems like it can’t get out of its own way with Gemini, is the first of the major foundation model providers to release an open source model. That in general is a surprising overture to the AI community, but if there were a company to do it, it would have been Google—they did the same thing with TensorFlow and BERT.
There’s been an expectation among the community for a long time that OpenAI would be the first to move here to try to build up some good will with the open source community that has run off with models like Mixtral and Llama 2 to create highly competitive models to GPT-3.5 Turbo. But Google beat them to the punch here, which is a rare win so far in its AI efforts.
That, though, has been overshadowed with yet another foot-in-the-mouth moment with its Gemini models. Google’s CEO Sundar Pichai had to send out a memo saying there would be “structural changes” among other corporate speak for now botching product launches. (Semafor got the memo if you want the full read.)
What this really increasingly highlights though is that there is an upper bound to red teaming and testing for any model, even at the scale of Google. Nearly immediately after ChatGPT launched eons ago (November 2022), an enterprising prompt engineer broke through the guardrails.
It begs the question as to whether these companies should be setting up white-hat “prompt bounty” programs for these kinds of prompt hacking. Even though these products are nightmarishly expensive to run at scale, it feels really unlikely that any product will emerge without some crack to exploit. And red teaming internally carries the same kind of blind spots and corporate cultural baggage for any other product launch.
Instead of getting a lot of praise for some of the breakthroughs of its Gemini models—particularly a 1 million token context window—Google CEO Sundar Pichai is once again on the defensive. And you have to wonder how patient people will be with the CEO of a company known for effectively sparking modern machine learning with TensorFlow constantly in the news for the wrong reasons for AI.
Toll fees for data sets
One of the biggest emerging blockers to actually useful applications of AI is having high-quality data sets that actually unlock the value in the first place. That can be in the form of data types for retrieval through RAG, or through data sets used for fine-tuning or creating adaptors for customization.
Before the release of ChatGPT, OpenAI essentially got a free pass to use any kind of data you could get that was freely available on the internet—particularly through some of the most popular platforms like Stack Overflow and Reddit. That data was a crucial input into its models, imbuing them with code generation capabilities, along with more experience in aspects like tone and voice.
That free pass has essentially collapsed, with many of those popular and crucial data sets shutting their doors and/or going after OpenAI directly. Now, the platforms are finally re-opening them and asking OpenAI—and others—to pay up. And we’re once again getting closer to seeing what the actual value of a high quality data set is.