AI in April (and Q2): RPA in focus, holistic evaluations, and eyes back on Datadog
Plus: OpenAI and Google are doing some stuff next week.
This issue is going to be a bit compressed and cover multiple topics as I’m recovering from a hand injury and a minor illness. I apologize in advance for any typos and whatnot.
Next week Google and OpenAI are both going to be making announcements on their AI products, with OpenAI sliding just ahead of Google with a livestream on Monday.
Sam Altman on shut down whether this is an announcement around a Perplexity competitor or GPT-5 on Twitter. But OpenAI says the announcements will be updates for ChatGPT and GPT-4 right before Google’s developer conference kicks off. And we’ll probably get another whole new list of startups OpenAI and Google might-but-might-not-but-possibly smush in the process.
OpenAI obviously has to contend with Google and its continued improvements, as well as Meta’s largest Llama 3 (400B+) that’s still in the works. But maybe we’ll finally see a response to all the pressure OpenAI has faced on its workhorse GPT-3.5 Turbo model, particularly with the release of the Llama 3 models. Llama 3 70B is both cheaper and more performant than GPT-3.5 Turbo on standard benchmarks. Cost and simplicity has long been one of OpenAI’s advantages, but that’s slipping away as more startups offer drop-in replacements for GPT-3.5 Turbo with an API powered by Llama 3.
What that would look like is unclear, but it wouldn’t be all that surprising if OpenAI came out with some kind of lightweight GPT-4 or upgraded workhorse model alongside whatever else it has planned. (A hub called the LMSys chatbot arena, where users compare results from competing models and rate the better one, has been exploding with speculation for a series of cheekily-named models that seemingly suggest they are from OpenAI. )
Now, with that speculation out of the way, let’s get to the actually fun stuff for the quarter so far: what everyone’s arguing about in AI lately. (I’ll be recapping conversations here from April up til today, even though this is technically supposed to be focused on April.)
The whole retrieval augmented generation (or RAG) pipeline is coalescing into a rather sensible (if not sprawling) chain of startups and tools. And, as usual, language models are getting closer to production—though the timelines still range from a few quarters to more than a year, depending on who you talk to.
From everyone I’ve spoken with lately, the key word this month (and the quarter) seems to be practicality. Not in the sense of whether a product built on a language model can or can’t be done—but whether it should be done in the first place. Experiments are giving way to analysis around cost, evaluation, and the simplest places to start to extract value.
So, with that said, here’s what everyone is talking about as we head further into the second quarter of the year:
RPA gets a closer look. Robotic process automation—basically that work around automatically clicking around on a website—is increasingly seen as a spot ripe for disruption by language models. In particular, RPA is seen as an initial stepping stone toward a much broader automation network.
Long context versus RAG. There’s a universal agreement that jamming more relevant information into a prompt yields better results. But now there’s this emerging debate-ish over whether gargantuan context windows or retrieval augmented generation is the way to go.
Evaluations beyond just the benchmarks. While a lot of these larger models try to one-up each other on performance benchmarks, a lot of companies are now facing a struggle of how to evaluate whether one did the thing they wanted to do. And it turns out that’s as much of an aesthetic review than it is for basic performance metrics.
Eyes are back on Datadog for its next move in AI. Whispers are starting to trickle in around Datadog’s intent to enter AI pipelines more formally this year. Like MongoDB with vector search, investors and users are waiting to see what Datadog—which is already deeply entrenched in enterprises—does in AI observability and evaluation.
Let’s dive into each, starting with RPA!
RPA gets a closer look as a primitive agent
One topic that routinely comes up is whether investing in the infrastructure layer—the startups like Ollama, LangChain, Chroma, or ElevenLabs—is saturated and the focus should be going to the app layer on top of all that infrastructure. There are a lot of natural use cases for AI that everyone tends to point to, like talking to legal documents, code generation, customer service chatbots, and so on and so forth.
Increasingly, though, there’s another broader focal point emerging on that app layer from experts and investors: robotic process automation, or RPA. And while these automations—a fancy way of saying a bot acting like a human in some form to complete a simple task—are currently more discrete, we already have a term for what would come next with substantially more advanced self-orchestrated RPA: agents.
RPA is seen by a lot of companies and investors as a kind of low-hanging fruit that they can quickly capitalize on, either through the use of smaller models or just calling one of the workhorse endpoints like GPT-3.5 Turbo. Most companies reflexively point to the case study put out from Klarna as what can be accomplished with a language model under the hood just routing requests to specific products, FAQs, or support pages automatically.
Modern RPA has largely been relegated to manual tasks bots can accomplish autonomously, such as, well, clicking around on a website to figure out if stuff works or how to break an app. The challenge with all that “clicking around” is minor changes in workflows and products can eject said bot out of a designated pathway and render it useless (or outright counterproductive).
This is where language models can potentially plug in here. The extreme version is that all app interactions are through natural language, which you would… automate with a language model. But in the meantime, language models can potentially filter through customer support tickets for escalation and de-escalation, or fiddle around with websites to find ways to break them.
UiPath is largely the company with a target on its back here. UiPath did announce a family of language models in March this year. But founded in 2005, UiPath effectively created a business that’s valued more than $10 billion. Meanwhile, Microsof, purveyor of Copilot buttons, has—shocker—a copilot for its RPA product Power Automate.