Supervised

What's next for Supervised

Matthew Lynley — Fri, 16 May 2025 16:20:11 GMT

a sleek, friendly pixar-style robot carrying a box of office supplies out of a dim room — midjourney

If you haven’t heard already, a few months ago I made the very, very difficult decision to wind down my full-time job writing this newsletter due to a back injury. Now that I’m feeling better post-surgery and getting eager to participate in all the exciting stuff going on in this space, I’m going to start writing again — for free — until I find a full-time gig.

If you want my thoughts on the modern data stack, skip ahead. If you want the full story of my newsletter decision and my takeaways from the experience, here it is:

A lower back injury causing extreme pain left me barely able to get write, much less actively try to report—which involves a lot of meetings, events, and being able to sit for long periods of time and focus on conversations. I spent six months, unsuccessfully, trying to rehab my injury and treat it through physical therapy, acupuncture and lifestyle changes. That unfortunately didn’t work, and I’m now writing this on the other end of back surgery.

Before launching the newsletter, I set several ambitious growth targets—ones that would make it viable to run a newsletter as a full-time gig in the Bay Area. As a result of the injury, I wasn’t able to hit them.

Momentum is everything when you are running your own thing. This was growing quickly for a year, and then came to a crashing halt once I stopped publishing and turned off billing as I tried to rehab my injury.

Launching my own project has always been a dream of mine, and I had hoped if it were to come to an end, it would be because of a series of my own mistakes and errors. Unfortunately, I massively underestimated the value your physical health plays into the actual health—and survival—of your product.

For now, I finally have some reprieve from the pain (and some creative energy) to write for you while I recover. But I’m now going to have to do it as a hobby while I figure out what to do next. (If anyone has ideas, I’m all ears!)

Words cannot express my gratitude if you’ve been a paid supporter of Supervised—especially those who were there from day one. You all made it possible to chase that dream, even if it ended in devastating fashion. I hope I was able to deliver on the value and promises I made when you came along for the ride.

To everyone who was even a casual reader—I know your inbox is crowded! Thank you for taking even a little look at what I was working on over here.

Finally, I want to say this over, and over, and over again for others that are running their own thing (or thinking of starting one): take care of your health. I know I did my best work when I was fully healthy.

Some final logistics to address:

Billing will remain off for the foreseeable future. This also means only existing paid subscribers will be able to see what is below a paywall in older posts.
I’ll continue to write as opportunities arise, though on a much more limited publishing schedule.
If you’re a paid annual subscriber and you are interested in a pro-rated refund, please reach out to me directly at m@supervised.news. I’ll do my best to process them in the coming months.

Again, my deepest thanks to everyone: my readers, my sources, and all the companies and startups that took the time to talk to me—especially right as I was getting off the ground. As always, you can reach out to me at any point for now, well, anything. My line will remain open, and I want to know about your life too!

Early signs of a consolidation of the modern data stack

(Full disclosure: I spoke with a handful of the companies in the broader modern data stack and MLops stack about potential roles. That has no impact whatsoever on my analysis here, which is based on historical reporting and prior newsletters.)

The looming consolidation of the modern data stack has been a constant theme (and somewhat of a joke) for the better half of a decade. Venture money flooded the ecosystem going back to the early 2020s, sending companies that were rewriting parts of the analytics and data science stack to multi-billion dollar valuations. In the same way that AI is incredibly crowded, the modern data stack has been incredibly crowded for a significantly longer period of time.

But in the last few months, the largest players in the industry have indeed finally started picking off smaller companies that made sense within a company, and not necessarily a full-fledged public company. We’re starting to see signs of it in both the modern data stack and the machine learning ops stack, which you could even argue are kind of interchangeable by now.

While not exactly part of that consolidation, my general observation here is that this kicked off in earnest with Dbt Labs acquiring SDF in January this year, which in some circles was considered a potential challenger to parts of Dbt as a kind of pre-compiled SQL tool. And it also wasn’t solely interesting for being in the category of “what if X, but Rust?” (Tobiko is also in this conversation, which in June last year raised $21.8 million from Theory Ventures and Unusual Ventures—as well as MotherDuck CEO Jordan Tigani, Fivetran CEO George Fraser, and Census CEO Boris Jabes.)

Since then Weights & Biases, basically the main option in model experiment and evaluation—and in some ways now observability—was acquired by CoreWeave. The Information reported that earlier talks between the companies landed around $1.7 billion. (I originally reported that Weights & Biases held talks for a $2 billion funding round around the beginning of March 2023, and that Weights & Biases had passed $30 million in recurring revenue in April last year.)

Databricks also acquired Neon, a serverless Postgres, for around $1 billion. Neon’s (correct) bet was that Postgres was never going to go away, and there was an opportunity to potentially extend it to AI applications with embedding and retrieval tools like pgvector. It also bet early that, though there would be a need for some vector format, the massive scale of independent vector databases like LanceDB wasn’t going to be necessary for most companies. Indeed, Databricks largely positioned the announcement as important for agentic workflows.

The list goes on—both in the form of acquisitions, but also in plenty of conversations about acquisitions.

Fivetran, the largest provider for ETL tools, acquired Sequoia-backed reverse ETL startup Census. The reverse ETL category contained not one, but three companies that touched it in some form or another: Census, Hightouch, and Rudderstack.
OpenAI acquired Cursor competitor Windsurf for $3 billion, bringing a proper IDE into the fold rather than plugging a key into Cursor (or having extensive back-and-forths with Claude Sonnet). You could call this an app on top of AI, but really, this is potentially so deeply integrated into workflows it’s simply part of the stack.
Observability colossus Datadog acquired feature flagging and experimentation startup Eppo, which Alex Konrad over at his new publication Upstarts Media reported was for $220 million. This is after Datadog first whiffed on its AI model evaluation last year while a number of startups were gaining momentum, and maybe signals an openness to buy into areas rather than build.
MongoDB acquired Voyage AI, a specialist in embedding products, for $220 million in February. This also came after a lot of hand-wringing last year over whether MongoDB would launch its own embedding model after playing the field for so long.
The Information reported that Snowflake held talks to acquire Redpanda, a streaming data service. This is also part of a potential shift (and ongoing joke) about all data coming in through streaming data pipelines, rather than batch data processing. (I’ll let you all argue amongst yourselves the difference between streaming and micro-batch.)

The very crowded space that was, by some (most?) accounts, drastically over-funded has begun an early thinning out process through acquisitions. And this in some ways is in service to two impending changes that are intertwined: data lakes en route to being an increasingly attractive option for inexpensive storage; and the fight over owning the data control plane for a company’s data. (For the former, look no further than the battle for Tabular, which Databricks snatched from Snowflake by more than doubling an early offer around $600 million that I first reported.)

Late last year (before I had to begin my rehab attempt), a number of categories entered the conversation for ownership of a more complete data control plane. Dbt largely owned the data transformation layer. Atlan, Alation, and Collibra were dueling for ownership of the data catalog covering lineage and governance (which Snowflake released its own open source product). Fivetran owned classic ETL and (still) stands to grow into AI workflows—especially as lakes move toward a preferred option. The customer data pipeline was (and seemingly still is) up for grabs, with Hightouch—recently valued at $1.2 billion in a round that included Bain Capital Ventures, Y Combinator, and Amplify partners—and Rudderstack still going after it.

(One outcome I’m keeping my eye on right now is what happens with Informatica and whether it is able to modernize in a way that it can truly make a run for ownership of the data control plane. As one source once joked to me, “I don’t want to have to deal with five logins.”)

The rise of Snowflake and Databricks gave birth to all of these micro-categories that are seemingly consolidating out to a smaller number of broader categories that make sense with expanded ownership of a general data pipeline. And while it doesn’t seem like there will be a collapse, there are a number of clear leaders emerging—all with slight overlap with one another, but largely playing nicely.

(To just also toot my own horn here, the majority of companies here were on my 2023 startups watch lists for AI and Big Data! I’ll revisit that in the next issue and provide a few updates.)

The difference now is that all of these companies have since adjusted their branding-slash-products to fit the mold of modern AI. And this whole process of the modern data stack extending itself into AI has been a constant theme since the launch of ChatGPT. Enterprises have stricter requirements like robust data lineage and governance, which also feed directly into systems like those using RAG. And good AI is predicated on good data, after all, whether that’s flowing into. model customization or data retrieval.

As a final note, one thread I am following very closely is the second coming of query engines like Starburst, another startup that was in my early watch list. Starburst—which includes Andreessen-Horowitz, Altimeter Capital, Coatue, and Index as investors—raised an enormous amount of funding that valued it at $3.35 billion in the mania around the modern data stack. But the shift in focus towards lakes with query engines on top of it has the potential to enormously benefit Starburst.

I suspect we are nowhere close to the end of this thinning out, which has been a very long time coming. The bigger question I have is whether this is going to be a thinning out leads to significant outcomes for firms and companies, or whether we’re talking about dollar-in-dollar-out acquisitions.

If you did indeed make it all the way down here, I want to say it one more time: thank you, from the bottom of my heart, for being a supporter and reader of my work as I went on this insane adventure. I’m forever in your debt.

As always, if you have any complaints, ideas, or thoughts you’d like to share, feel free to reach out to me. You can find me at m@supervised.news or on Signal at (415) 690-7086.

Please send OpenAI your good vibes as it dives into model evaluation

Matthew Lynley — Tue, 01 Oct 2024 20:26:53 GMT

a sleek friendly pixar-styled white robot trying to shut the door to a closet that is filled with hundreds of millions of documents, the documents are starting to fall out of the room as the robot shoves the door closed, sunday funny comics aesthetic —midjourney

Around the time of this post I suffered a severe lower back injury that has pretty much prevented me from working. Paid subscribers received a 1 month extension and billing was fully paused beginning on October 15. In keeping with my policy, subscribers are not paying for what they are not getting. I’ll provide a further update in a few weeks once I’ve had an opportunity to treat the injury. My deepest, deepest thanks for all of your patience.

While OpenAI’s advanced o1 “stop and think for a second” models may be a technical feat, it’s also increasingly working to enable developers onto its significantly cheaper workhorse models.

As part of this, OpenAI has to wade into one of modern AI’s comically difficult problems: evaluating whether the response of a model is “good” in some quantitative and automated way. The industry has long tried to graduate out of vibes-based evaluation, but as of yet there’s no immediately reasonable solution. And yet, here we are, with OpenAI trying to do that as part of its new distillation tool.

Its distillation tool is one of a series of new developer tools OpenAI unveiled at its DevDay. Its first, which has a much bigger wow factor, is the availability of their real-time advanced voice assistant as part of a broader “real-time” API. But there’s also a additional features that focus on mitigating usage costs and increasing performance without having to drop the cost of their APIs.

Its distillation tool effectively uses larger models to fine-tune a smaller one (like GPT-4o mini). But OpenAI is trying to wrap it into one kind of “productized” package that streamlines the process. Developers can used a stored completions feature for exactly what it sounds like, and manage the evaluation and distillation directly through a user interface rather than integrating some complex suite.

OpenAI is also releasing a prompt caching tool that cuts costs for ongoing repeated content within a context window. Added together with its distillation suite, the updates are a roundabout way of finding a way to shave off what developers have to pay to deliver a high performance AI-powered product.

OpenAI has long run a playbook of being extremely convenient both from a usage and a procurement standpoint, even if you incur a cost penalty by ignoring the alternatives. Enterprises, though, have become much more concerned with costs (OpenAI’s in particular), though. As a result OpenAI’s convenience advantage has started to become a bit more brittle in the face of a large list of alternative and highly competitive products.

So the next step is clearly to find some way to add different layers of efficiency to ease the cost burden on developers and enterprises. It also had to find a way to drop the difficulty of some of the more advanced techniques to get costs under control. That also meant building a product that wades into everyone’s favorite ~~existential crisis~~ development challenge in the form of model evaluations.

The realtime API is cool as hell and is going to be useful for a lot of potential product cases. But, setting aside that for now, let’s get to the more in the weeds bits: how OpenAI is going to shave off costs at a time when it’s under pressure from all different directions with competing (and cheaper) products.

Subscribe now

Getting into the mess that is model evals

There are two direct routes that OpenAI is now deploying to find an optimal point between performance and value for both developers and for OpenAI. One is to collect as many possible efficiencies as you can when a model is actually in use. The other to funnel as many people as you can to a workhorse model—like Gemini Flash or GPT-4o-mini.

Its prompt caching tool, where it’ll offer a 50% discount and faster prompt processing times on input tokens that it’s already seen recently, is playing a little bit of catch-up. This is becoming a pretty popular feature for multi-turn products that have a lot of information stuffed into a context window, such as large documents, code bases, long customer service interactions, or tools that require more advanced retrieval.

This feature was already available in Claude and Gemini, though they came in different flavors. Anthropic charges slightly more to write to a cache, but input tokens from that cache are 10% the cost of normal input tokens. Anthropic’s prompt cache expires after 5 minutes, and that clock restarts every time the cache is used. Gemini’s context caching meanwhile also offers a substantial discount to normal input prices and charges an hourly rate for every million tokens cached (with no upper bound on the cache expiring).

OpenAI—in its bid to try to simplify the experience as much as possible—basically sweeps all the complexity under the rug and applies the discount and performance bump right away while the cache window is active. As of today, it’s automatically applied to its latest models, with developers not having to really do much.

OpenAI’s distillation “suite” includes enabling developers to store completions in order to generate a large set to fine-tune its models. It also includes that evaluation tool, in which users can use a bunch of off-the-shelf evaluations (like groundedness) or design a custom evaluation. In that way, enterprises can pull in some of their custom evaluations and link them together directly within a single OpenAI interface right on top of completions.

A rare opening against Datadog

Matthew Lynley — Fri, 27 Sep 2024 18:47:45 GMT

a purple puppy in a medical lab coat sitting at a desk writing on a piece of paper, the paper is covered with math problems, and there are similar papers taped on the wall in the office, sunday funny comics aesthetic, --ar 4:3 — midjourney

This issue was finished on a mobile device so I apologize ahead of time for any typos and formatting errors.

OpenAI’s sprawling portfolio problem

While OpenAI’s latest model, o1, is clearly a massive improvement in performance it’s also creating a potentially new challenge: product sprawl.

Its latest products, o1 and o1-mini, essentially gives users and customers a tradeoff: you can wait longer and, for now, spend more, but the output should be much better. It’s not the kind of API you’d plug into a call center, but it does fill kind of a new niche for a company that already covers a whole range of niches beyond just a “one thing that does almost everything well.”

OpenAI’s o1 use case is basically the “you have to think about it for a second” problem. Box CEO Aaron Levie gives an extremely good enterprise example of having to find a very specific parameter of a contract—in this case, the date of the final signature on a contract, and thus the date it effectively went live. This is a really crisp “you had to think about it for a second” problem: there was an added layer of complexity to parse to an otherwise straightforward, in this case the possibility that people signed the contract on different days.

That’s also a kind of newish use case for OpenAI, where historically it might have been a long set of API calls and begging the models to do what they need to do in the form of prompt tuning or retrieval augmented generation (or RAG). Instead, that whole process could—hypothetically—get compressed into a single call or two and simplify some architectures.

But at this point OpenAI has been undergoing a kind of “SaaS-ification” where it matures into a real business. A number of executives and co-founders have continued to make their way out the door, like co-founder and chief scientist Ilya Sutskevar leaving to found a new safety-focused AI startup that raised $1 billion and co-founder John Schulman leaving to join Anthropic. As of this week, that list also includes CTO Mira Murati.

As OpenAI matures into a real business, it’s running head-first into the challenge of avoiding product creep at a time when its product portfolio (or in this case, model portfolio) is starting to get more and more complex. Larger product portfolios are inherently more difficult to manage, and the biggest challenge companies at this phase face is keeping it from getting unwieldy and difficult to communicate value to customers. And with the quality of models constantly improving, any small number of distractions could lead to a usurping by one of the other frontier modeling companies.

So, just quickly, let’s recap OpenAI’s now very large portfolio of APIs:

GPT-4o: a multi-modal, expensive (though not immune to price cuts) model that’s supposed to be Good and General Purpose.
GPT-4o mini: a less powerful version of GPT-4o that’s designed to be a successor to its workhorse GPT-3.5 Turbo model to satisfy a large number of simpler use cases.
Fine-tuned versions of the above to provide enterprise-specific needs, though they are fed through an API which might turn off some more security-conscious companies.
Batch versions of the above with a 24 hour completion window, for a 50% price discount.
o1: a model that trades speed and price for quality by allowing it additional time to “reason” about an answer. Basically, that “stop and think about it” question.
o1-mini: like 4o mini, a smaller version of o1 designed for… the same “stop and think about it” question. But we’ll just run under the assumption there’s a set of problems this is really good for.
Whisper: arguably the best speech-to-text model on the market that was most assuredly built for generating training data that OpenAI needs.
Text embeddings: the not-quite-the-best embeddings product whose advantage generally seems to be that it’s provisioned with other OpenAI products that reduces procurement’s headaches.
Text-to-speech: an API that you could, hypothetically, slot into something like a call center assuming the latency works. It has both a normal and HD version of the API.
Advanced Voice assistant: a technical marvel of a product that lets you have an active conversation within ChatGPT where the obvious killer “use case” still isn’t super clear.
ChatGPT: OpenAI’s “productized” version of all of the above in one frontend enterprise-friendly wrapper.

Amid all of this, Meta has started to show additional chaos by releasing updated models in almost every single one of those categories—except they’re open-ish source and power a very different array of products even though they exist in the same “bucket.” With o1, OpenAI is essentially taking another shot at category creation, but also risking a ballooning portfolio.

Here’s how all the pricing ends up playing out, and including a handful of competitors as a point of reference:

And these costs are probably nowhere close to where things will land after o1 has been out for a while. OpenAI has already started incrementally increasing rate limits, with o1-preview recently going up from 30 per week to 50 and o1-mini going from 50 per week to 50 per day.

While OpenAI’s appeal has always been in some sweet spot between convenience, price, and performance, its ballooning portfolio certainly poses as much of a challenge as an opportunity. I seriously doubt that the prices will remain this high, as its next GPT model will be ready at some point. But for now it at least gets something into the hands of developers that has an aggressive price tag at a time when it’s trying to raise a colossal round of funding.

The challenge here, though, is the same that any maturing company starts to face over time: product creep. While OpenAI technically has “two” products in the form of its APIs and ChatGPT, those products all have a ton of branches that serve a very wide number of use cases. The APIs also go well beyond just chat completion and text generation, and include a whole variety of modalities. And its voice product is probably the most awkward part of its portfolio, seemingly a “wow” part of ChatGPT.

And product creep is a Known Problem in startup-land, if you can even call OpenAI a startup any more. As a startup matures, and it’s user and customer base grows, it has to directionally develop—anticipating the handful of use cases that satisfy the most customers without building everything for everyone. Or, more succinctly, do a few things but do them well.

Offering this wide array of use cases gives OpenAI the ability to funnel users to some kind of steady state where it isn’t necessarily making money, but at least it isn’t losing money. On the API front that’s been traditionally pushing users to its workhorse models (particularly fine-tuned versions), but that’s a little less clear with ChatGPT.

ChatGPT and finding efficiencies in inference

While the new models are rate-limited in ChatGPT, that is also an extremely important part of its business beyond the APIs. OpenAI COO Brad Lightcap told staff that OpenAI has more than 10 million paying subscribers for ChatGPT and an additional 1 million subscribers for businesses, according to The Information. (Bloomberg earlier reported that OpenAI has 1 million paid business users for ChatGPT.)

We can of course do a lot of napkin math, but would arrive at the same conclusion: ChatGPT’s enterprise business is, and will be, a massive part of OpenAI’s business beyond the APIs.

But it’s not like the revenue from ChatGPT will just scale up with usage of o1 directly like it would with the API. The cost consideration for actually running it, either via API or within ChatGPT, is probably going to change as pre-training compute resources shift toward inference.

Fortunately for OpenAI, this is an area where there’s already a lot happening. One approach that’s getting a ton of extra attention right now among those in the industry I talk to is Monte Carlo Tree Search as a way to narrow the amount of compute to generate a high-quality result. And distillation, another way of “shrinking” the larger models, is also gaining a lot of momentum as interest shifts to managing inference costs of high-performance models.

“That’s the sweet spot, blending strategies from traditional predictive machine learning with years of work that went in there with modern techniques,” Sri Ambati, co-founder and CEO of model developer and platform h2o.ai, told me. “Tree search is an absolute a genius trick, it’s very easy low-hanging fruit that’s combined with the brilliance of LLMs.”

The other bit that comes up most often in conversations with experts and sources is Noam Brown’s work at OpenAI. Brown is widely considered a premier expert in game theory and many have wondered how his work would be applied to OpenAI’s products. Brown was also a co-author of a paper that in part examined applications of Monte Carlo Tree Search in developing human-like agents.

The challenge OpenAI has for its APIs is that, based on most enterprises and platforms I talk to lately, costs are either number 1 or 2 on the list of considerations for building an AI app for production use cases. But OpenAI has certainly shown a willingness to drop its prices over time to remain competitive.

And it’s also telling that OpenAI’s ChatGPT enterprise business seems to be the significant driver for its business, making all this development essentially in service of that enterprise kit. The alternative—building something bespoke internally by using really cheap stuff off the shelf—is increasingly compelling for companies with way more advanced governance and cost requirements.

The cynical take here is that OpenAI is trying to say something along the likes of “hey, see, we’re still building extremely advanced stuff, don’t ignore our fundraising calls.”

The flip side of that argument, though, is OpenAI—with what appears to be a massive funding round that’s oversubscribed, according to CNBC—has already matured into a company better suited to enterprise products. And now OpenAI just has to make sure it isn’t confusing enterprises on sales calls with a colossal portfolio of models.

An opening in Datadog’s armor

Datadog has enjoyed an incredibly enviable position of power and a relatively pristine track record of launches that cause more than a few companies adjacent to it to be nervous every time something comes out.

Startups, investors, and developers in the rapidly-burgeoning AI space had long expected Datadog to storm into building tools for evaluating the performance and success of LLMs this year. It was already a considerably difficult space that has been trying to move beyond a “vibes” based approach, but if there were a large company best positioned to do it, it would have likely been Datadog.

But Datadog’s annual conference came and went in June, and what came instead for those startups (and the investors in them) was a bit of a sigh of relief.

Meta's Llamas go local

Matthew Lynley — Thu, 26 Sep 2024 17:48:26 GMT

an absolutely tiny llama, the size of a fly, sits on top of a home office desk, the desk is 100x the size of the bug-sized llama, sunday funny comics aesthetic, --ar 4:3 — midjourney

Existing paid subscribers received a one week extension to account for a missed issue last week while I’m working through another lower back injury. Thank you everyone for your patience!

Meta’s model paths come together

Meta’s Llama portfolio continues to grow in size, complexity, and nomenclature—but this time, it’s going after what’s quickly becoming one of the most important potential markets in AI: networks of task-specific smaller models.

While Meta has released larger models in a way to sow a little chaos at the top end of the quality spectrum with OpenAI and Anthropic, it seems to be setting its eyes on sowing more chaos at the lower end of the spectrum where Google’s Gemma 2 and Microsoft’s Phi models have become popular. Meta’s new tiny models, which it says in a blog post are designed for edge use cases, fit neatly into one of the kind of four “bucket” cases. And Meta is now competing with many of its “rivals” in all four tiers.

“Tiny” ones that are suited for trivial and straightforward tasks with very little compute: Gemma 2, Phi-3, Llama 3.2 1B and 3B, and other various open source models. These models are well-suited for cases where you have access to little memory or compute, such as the edge devices Meta suggests.
Smaller, more general purpose “workhorse” models designed to handle mostly trivial tasks that are still too complex for the “tiny” models: Gemma 2 9B, Llama 3.1 8B, Llama 3.2 11B, Mistral NeMo/Pixtral, and other various open source models. In these cases you’re less memory-constrained, but cost starts to become a consideration, making these smaller open-ish source models more attractive than APIs like GPT-4o-mini. These could involve summarizing a high volume of inbound documents, like sales calls or customer support tickets.
The not-quite-as-good-as-GPT-4-but-still-pretty-good category where tasks start to become a little more ambiguous and open-ended: Meta’s 70B (and now 90B) models fit into here, but also you could throw Cohere’s RAG-focused Command-R+ and the larger versions of the Gemma-series models. You could imagine more complex searches, like across legal documents, that would be well-suited to a GPT-4o but are willing to take a small penalty on performance and ease of use—such as some increased complexity for fine-tuning—in exchange for lower costs.
The substantially larger and more general purpose versions requiring high-quality responses: Llama 405B and the available foundation model APIs like Anthropic’s Sonnet, Gemini Pro, and GPT-4o. These are more suited to extremely high quality conversational experiences like the ones you might find in Notion (which, more on that in a second).

OpenAI is also seemingly trying to create a new category in those buckets with its slower-but-more-powerful o1 series models. In that case, OpenAI is once again flexing its research capabilities, even if that means trying to build out into a new niche.

But an enormous amount of emphasis from this announcement seems to land on that “edge device” part, with Meta off a number of different pieces of hardware it will work on. While that might end up providing some useful experiences on devices, the results from Apple’s (beta) approach integrated into one of the most prevalent operating systems on the planet have been pretty mixed. The upper bound of these tiny models on edge devices—even augmented with a set of task-specific adaptors—isn’t so clear.

But Meta will also be able to capitalize on the enormous enthusiasm around local device AI inference, which has shown a lot of opportunity around rapid prototyping and development. Meta here gets to once again pounce on a developer community that’s craving more options to play with, and then just grab the best ideas for its own edge devices like its prototype Orion headset. And that starts with finding the absolute bare bones utility of those smaller models.

Extending RPA to the edge

While these smaller models are designed to work well on edge devices, with some customization on task-specific or company-specific data they excel at simple jobs like classification, entity extraction, or summarization. The emerging basic use case for AI has basically been to construct a more advanced version of robotic process automation (or RPA) through the use of smaller models chained together that all independently resolve these more trivial tasks.

That RPA-oriented “base case” has emerged as a pathway to quickly recognize an actual return on investment in AI tooling at a time when we’re all talking about how we’re in an enormous AI bubble. And a lot of this was already powered by Meta’s existing recent smaller Llama 3.1 model, though there has generally been some interest in experimenting with the even smaller ones amongst companies I talk to (particularly the “tiny” version of Phi 3).

Meta’s existing smaller model (8B) is getting an upgrade to being multimodal and slightly larger (11B), but still has a small enough footprint you could probably comfortably run a compact version of it on a Macbook Pro. A Hugging Face blog post indicates a compact version of its vision model takes up around 10GB of GPU RAM during inference. And unsurprisingly, its “tiny” models are already available on Ollama, a popular tool for running smaller models locally on devices like a Macbook Pro.

It’s also why we’ve started to see a kind of obsession over extreme speed when generating results from smaller models—like in the thousands of tokens per second. Rather than just looking for a single response, fast inference platforms enable companies to quickly chain together task-specific models into a kind of proto-agentic network to, at the end of the chain, accomplish some more difficult task.

While these kinds of “buckets” have been in the making for the past year or so, Meta’s more direct entry into each with the newest iteration of its Llama 3 models has formalized the boundaries. Rather than constructing the categories based on model size—which can vary while being similar in performance—the divisions are based on task complexity.

Many enterprises I’ve spoken with interested in actually putting AI-based tooling into production have shifted their thinking in how they plan to deploy AI. Expectations have dropped dramatically from a time when executives thought they could throw everything into an OpenAI API and assuming you’ve completely replaced entire departments, but those lowered expectations have also exposed how practical some of these smaller language models are for actually useful tasks. And hosting and deploying these smaller and more customized models is significantly more attractive now than it was a year ago.

Part of that is because of the significant cost to relying on APIs like OpenAI and Anthropic for those more powerful frontier models—even when it comes to their smaller counterparts like GPT-4o-mini. While the price of those API backed models continues to go down (and is, to be clear, a race to the bottom) it’s still considerably cheaper to host your own custom version of a smaller language model. And in some ways, those smaller models can even be a bit better because they’re more tailored to specific use cases.

Going even smaller than the “practical” small models exposes a lot of potential new use cases when it comes to edge devices—which could be phones, but could also be sensors, an array of Raspberry Pis. But when we’re talking about “edge” use cases, one in particular has lately gotten a lot of interest from executives and sources, even if it’s still just a curiosity: in-browser language model experiences.

In this case, using WebGPU, the “edge” device is actually a browser. One example thrown I’ve seen thrown around before is an in-browser SQL-generating assistant running in tandem with DuckDB, creating a whole kind of new local analytics experience. (There are some technical concerns that come up here around threads and available memory, but again, it’s an example.)

Meta, in the meantime, is also benefitting from jumping on the very hyped and exciting—and still relatively impractical in production—focus on local model development. Llama.cpp, Ollama, and other tools enabling local model usage has sparked a whole wave of enthusiasm around rapid prototyping and development of potential AI-powered products at a larger scale.

(Meta, alas, did not give me a heads up on the news or invite me to any of these events. One day!)

Notion tries to do a bit of everything in AI

On a rather hectic news day with the new Llama-series models and the abrupt departure of OpenAI CTO Mira Murati, Notion also rolled out a significantly updated version of its integrated AI tools.

Notion is considered among most of the sources and industry experts I speak with to be one of the most advanced companies when it comes to actually putting AI tools into production. There’s a handful of companies routinely come up, but most typically think Notion’s approach and architecture is one of the most complex and effective ways to utilize everything that’s available.

And in that way, a lot of investors, developers, and other startups look at Notion as a kind of “bellwether” in AI development—letting the startup do all the experimentation and product development to see what actually is possible and sticks with customers.

Its latest update includes connectors to Slack and Google Drive, introducing a new set of unstructured data pipelines, as well as other updates like analyzing files and image attachments, style-specific document generation and editing, and direct selection of knowledge sources.

“I’d say we are small enough and nimble enough that we don’t have to go through a bunch of layers of approval to try something new out,” Shir Yehoshua, AI engineering lead at Notion, told me. “We have a very prototype-y culture. Our general model is to prototype it first and if it works—whether that’s an evaluation tool, or an actual LLM provider, a vector DB—as fast as we possibly can and get very empirical data on how well it works. And also try not to fall into a sunk cost fallacy.”

Putting the brakes on the AI hype

Matthew Lynley — Tue, 10 Sep 2024 18:17:07 GMT

a small pixar-style robot in a business suit standing in front of a whiteboard, the whiteboard has a lot of math equations on it, sunday funny comics aesthetic, --ar 4:3 — midjourney

Over the past few weeks I’ve been asking dozens of industry executives and sources one of the same thing: how have the questions enterprises have been asking about AI tools changed in the last six to nine months as they get more serious about getting projects into production?

Almost universally the response has been that the tenor of these conversations has changed, especially when talking to enterprises and executives that initially bought into the original AI hype. Or, more specifically:

Those enterprises and executives are still bought in on putting out AI products, and recognize an urgency to find use for them to either remain competitive or to save on costs.
They’re much more educated and more deliberate about the way they are looking to implement these models, and much more discerning when it comes to the cost of operating those tools.

In short, enterprises are getting, well, smarter about what they are looking for when they are evaluating AI-based tools. And it’s generally confirmed the kind of “vibe shift” that started to happen in the past six months. Despite the recent fuss about whether or not we’re in an AI bubble, the “hype cycle” actually ended quite a while ago as companies have started to figure out how these things are going to actually useful now.

And we’ve largely started to see that emerge already in more straightforward—and boring—tasks, even if companies aren’t very loud about it. The era of thinking you could wave a OpenAI-shaped wand and generate completely new lines of business or completely replace an entire class of workers is effectively gone, but at the same time, there’s enough signal that companies aren’t just throwing the idea of deploying a generative AI tool into the trash or writing it off as an experiment.

“It’s a lot less frenetic as compared to how it was,” Brian Raymond, CEO of unstructured data ETL provider Unstructured.io, told me. “Things aren’t changing every hour or every week from an industry standpoint, we have a little more confidence on where things are going, and there are fewer surprises. We’re in this unsexy phase, but this is the phase in which most of the value is gonna get created among the organizations that are trying to leverage generative AI. It’s less rapid experimentation and mind-blown emojis, and it’s more like, let’s drive ROI.”

All this is pretty far from the days right after the launch of GPT-4 when certain executives at a company would barge into a room shouting “AI” and leaving the team scrambling to build something.

The list of needs for enterprises keeps growing

While we all talk about whether or not AI is in a bubble for the Very Big Models, what’s become clearer is how enterprises are evaluating generating near-term ROI through the use of language models. Most companies actively using these tools I’ve spoken with have done that through automating smaller tasks, like summarization and classification, with customized smaller language models. By focusing on these RPA-like tasks with smaller models, they’re able to save significantly on cost relative to paying for an API product like one from OpenAI or Google.

These kinds of tasks are augmented with the use of information retrieval techniques like retrieval augmented generation (or RAG), which effectively fetches some extra information for a prompt in a given language model. Companies convert their data—which most likely sits in a provider like Snowflake, Databricks, or others—into a format through a process called embedding. They then make that data readily available on-demand for these prompts. (Most developers I’ve spoken with have also joked that all of these “new” techniques that people keep “discovering,” like RAG augmented with a graph database, are also decades old.)

The needs of those enterprises are also becoming a lot more sophisticated as time goes on. Rather than just firing stuff into a prompt, enterprises are increasingly looking for governance tooling around these language models, such as lineage and, seemingly more recently, role-based access control (or RBAC). Everything old, it seems, is new again—just this time in AI.

AI in August: RBAC is back, data as a product, and something about a bubble

Matthew Lynley — Thu, 05 Sep 2024 19:47:46 GMT

a small friendly robot staring up at a large tree as the leaves start to change with autumn arriving, sunday funny comics aesthetic, --ar 4:3 — midjourney

The general consensus from sources and experts I talk to is that the universal truth of summer in tech and venture capital held this year: not a lot happened and everyone was on vacation.

That’s starting to pick up as we head into the fall, both with a lot of stock prices flying in opposite directions and earnings reports wrapping up. As teams start to trickle back into the office (home or otherwise), we’re starting to come up on the whole “AI in prod” mental deadline that showed up in a lot of companies I’ve spoken with over the last few months. The failed projects are going in the trash, and the stuff that’s useful is working its way into roadmaps.

That kind of “nearness” to production, as usual, is showing up in more friction and barriers to getting things out the door. In the past few months, that’s centered again on enterprise governance requirements and other some other kinds of technical walls—and a kind of pragmatic shift in the works at a lot of these companies think about their team structure.

And a lot of arguing about bubbles, which, as usual, is a much more complicated and nuanced situation than it seems at face value. So, with all that said, here’s what’s coming up as we head into the fall:

Role-based access control has entered the chat. Shortened to RBAC (and pronounced are-back), developers and industry executives had added the need to partition out who can get access to what data in an a language model into the very large stack of stuff that’s needed to get done before something enters production. This is coming up mostly in the context of internal chatbots, but there’s more to it under the hood.
The data engineers are now under the spotlight. AI (generative or otherwise) is quickly being recognized as a data-centric product. The data engineering teams, traditionally bound by the hip to analysts, now seem to be getting a lot more exposure at larger organizations.
It has been (0) days since we have said AI is a bubble. After more than a year of talking about whether AI is in a bubble, we are again… talking about whether AI is in a bubble. We’ll go over this again, but the answer is that it’s complicated.

The RBAC wall to get into production

One of the biggest potential problems most companies that I talk to are fretting about is whether some random employee will get access to sensitive data that they aren’t allowed to see. This isn’t restricted to PII or anything like that—it could even be in the context of someone who isn’t on an HR team getting access to salary information they aren’t supposed to able to view.

This potential snafu has a lot of names within companies, though most people I talk to throw the term “leakage” on it. As a result, there’s some skittishness around whether to basically feed all of a company’s data into a custom model (fine-tuning or otherwise), trying to slap some guardrails on it, and hope for the best.

Well, that’s not the first step companies usually take when they are looking at deploying something custom that taps company data. Instead, that company data is embedded into a database (such as a vector database, Postgres, or MongoDB) and a model can fetch it for a prompt through a process called retrieval-augmented generation, or RAG. There’s varying levels of sophistication to it, with some companies also using it in conjunction with graph search.

Each data point you’d want to retrieve is embedded in a “chunk,” or some block of information. The chunks can vary in size and format, ranging from just sentences in an email to full documents, and all that data has to be pre-processed in some form to even get to a point where a developer can embed it and make it available through RAG. But the list of governance requirements is continuously growing as companies get more serious about putting tools on top of language models into production, and determining who can access each chunk of information is now part of the set of requirements.

This is where role-based access control comes into play by assigning a kind of insulation layer around data points to determine who gets access to what. Some companies have already been trying to tackle the RBAC problem since earlier this year, but the momentum for it seems to generally be picking up among companies and investors I talk to—likely as a byproduct of these companies trying to get these products based on language models out the door.

“We went from a world where we were focused on data loaders in early 2023, to a world now you have to think about RBAC and bounding boxes so you have traceability and can show your homework,” Brian Raymond, CEO of unstructured data ETL startup Unstructured.io, told me. “Regardless of what’s in that manilla envelope of data that gets written to a vector database, you have to have timestamps, owner, which group it belonged to for RBAC, version history, and a lot of other information. Any time you’re doing more than a proof-of-concept, it is a complete blocker.”

As with the example above, this mostly comes up when talking about companies building internal chatbots that replace sprawling Wikis that are increasingly inaccessible. But it’s pretty easy to see how it extends out to potential end users for even a customer-facing product—after all, an employee is a customer of an internal-facing product.

“We generate around 30 types of metadata during preprocessing, and that’s critical when you’re doing retrieval,” Raymond told me. “And we’re transposing the requirements data engineering teams have been developing over the last ten years onto the generative AI data stack.”

Again, guardrailing is one way to do it, but most developers and companies I talk to are increasingly recognizing that they’ll need something more sophisticated in place. And that leads to another emerging problem in AI, which is…

Someone please check on the data engineers

Each of these companies is also increasingly realizing that the problem of getting a generative AI product in production is… the same problem as getting a regular data centric products into production: high quality proprietary data.

Data engineers have traditionally built out pipelines that feed into more “classic” workflows like analytics. The emergence of Dbt and the whole analytics engineering role started to extend the responsibilities of analysts to handle more data engineering-oriented tasks. But there are two emerging trends within all these companies that are actually trying to move past a proof-of-concept into something that can generate a real return:

Many companies are tasking software engineers with building all of these AI-powered tools, who have to quickly grapple with the importance high fidelity data plays into all of these tools.

The point of lightning-fast model inference

Matthew Lynley — Tue, 27 Aug 2024 22:53:56 GMT

small robot furiously typing on a typewriter, the typewriter has flames coming out of it like it is spewing letters and words, sunday funny comics aesthetic — midjourney

When ChatGPT first came out, it started out with—and popularized—a fun user experience to go with getting a message back from the early AI product: having the words print out in sequence as they are being generated.

These tools built on language models don’t have to do that. You could instead wait for the whole response to get generated, and then show the response. But for one reason or another, that experience largely stuck with standard interfaces for apps built on top of language models—even as the speed of token production has become so high that you probably can’t even discern it at this point.

But while there’s some threshold where the speed at which tokens are generated just kind of stops being visually noticeable or helpful, there’s another reason all this is happening under the hood. These responses generated at lightning speeds aren’t built just for humans—they’re built for the bots that they’ll be talking to in the future.

“I don’t think the interesting work is human-read in the future—the interesting work is machine-read,” Andrew Feldman, CEO and co-founder of Cerebras Systems, told me. “What you’ll see in the future is concatenations of models, where the output of one is the input the next. That latency stacks. If you wanted to link 6 or 8 of these together, you wait a minute to get an answer. What we know is nobody waits a minute.”

Cerebras Systems, a developer of custom AI chips (that are also colossal), is one of the latest companies to step into this blistering speed race with its new product, Cerebras Inference. Feldman tells me that you’ll get speeds above 1,800 tokens per second on the smaller Llama 3.1 8B model, and 450 tokens per second on the larger Llama 3.1 70B model. And Cerebras’ product inferences both models at full 16-bit precision, rather than a compressed 8-bit model that is often the default option—particularly for the larger models.

This all becomes relevant in the context of (the current semi-fever dream) of agents—a collection of models that can operate more autonomously and solve complex tasks by passing them amongst one another until they return a result. Each individual model requires a single call, and a decision on where to route that result, which can hypothetically balloon with the complexity of tasks. And all this doesn’t even include the possibility of failures in a chain of operations, such as a prompt rejection (such as due to it returning personally identifiable information).

Cerebras Systems is one of a number of companies looking to exploit architecture alternatives to Nvidia’s hardware in order to satisfy a very broad set of use cases. And they are trying to demo the lightning speed of all of these operations, much like companies like Groq and SambaNova Systems. Inference platforms like Together AI and Fireworks also try to push the envelope on tokens per second.

And while it all moves faster than a human eye can read, it turns out that this blistering speed is just one of what are likely many precursors to the ability to build out networks of models that can satisfy some of the potential dream scenarios for language models.

“This new processing power is a game-changer,” Jonathan Corbin, co-founder and CEO of Maven AGI, told me. “It allows AI agents to handle vast datasets in real-time, make more nuanced decisions, and adapt quickly to new information. We view this as crucial for developing AI agents with human-like understanding and responsiveness.”

The case for agents and speed

Right now we’re a considerable ways off from creating some kind of one-size-fits-all “agent” that can perform a task of more arbitrary complexity., That could either be through the use of some model with insane reasoning capabilities or a long, long chain of models strung together.

But there is already a lot of low-hanging fruit that current, off-the-shelf hardware has the potential to resolve. More specifically, these language models—particularly the smallest Llama 3.1 8B model—are able to solve compact tasks that involve processes like classification, summarization, or entity extraction. Each of these excel in customer service use cases, particularly when routing problems and determining whether to rope in a human to resolve it.

The complete base case of all of this is having a “robot” complete each of those tasks in isolation, and then moving it onto the next robot to complete the next incremental step. The classic term for this is, aptly, robotic process automation (or RPA). Except instead of having a bot click around on a website to test something, this process is taking in information and determining what to do with it at a much higher level. These are the same types of tasks companies were using older language models, like BERT, for—they’re just substantially more advanced and can handle more complex problems.

Klarna, the company people reflexively point to when it comes to the potential of dropping AI into customer service, reiterated in an earnings report that the use of AI was able to perform the work of more than 700 employees, and reduce the average resolution time for a customer service case from 11 minutes to just 2 minutes. It might be boring, but batch processing is largely where there’s a lot of value to extract—and waiting a fifth of the time to resolve some customer service issue seems pretty helpful!

The free tokens will continue until business improves

Matthew Lynley — Fri, 23 Aug 2024 21:47:42 GMT

a small friendly robot standing on a massive pile of arcade tokens, the pile of tokens is miles high, sunday funny comics aesthetic — midjourney

Two topics today: first is on all the free stuff we’re getting to convince me to fine-tune a model; second, why everyone is back to talking about durable execution. Once again, thank you everyone for your patience as I get ramped back up!

The free stuff playbook is back in AI

When Sam Altman was briefly removed from OpenAI in November last year, it represented this kind of weird moment in AI where most companies weren’t led with a very Silicon Valley growth-at-all-costs SaaS-ify everything mentality. Sure, there were a lot of mega-rounds, but one of OpenAI’s early advantages was that it was able to undercut the competition with an easy-to-use API.

Since then, the SaaS-ification of AI seems to have quickly become the norm. Providers are constantly racing each other to see how quickly they can send their prices as close to zero as possible to one-up each other. This is true with the more capable foundation models, but it’s also particularly true for the workhorse models from those model providers—the ones that benefit the most from customization on proprietary data.

And while the basic APIs are easy to simply swap amongst each other—whether that’s Google, Anthropic, OpenAI, or an inference platform like Together AI, that you can literally drop in with a few lines of code—the “lock-in” now potentially comes in an alternate form: fine-tuning APIs using proprietary data.

With fine-tuned models, which are essentially modified to perform well on specific use cases for your business like summarizing documents, the switching cost to a new service isn’t zero. Rather, the switching cost is both the time, headache, and tokens necessary to re-do a model customization job on a different service. And, just like in the on-demand and Web 2.0 boom, the two largest providers are trying to snag you with a bunch of free stuff.

OpenAI this week said it was making fine-tuning available for its GPT-4o. And while fine-tuning for GPT-4o mini has been available since late July to its higher-tier organizations, it’s now available for all paid usage tiers, per the blog post they put out. OpenAI already made 2 million training tokens available per day for GPT-4o mini, and this week it was making 1 million training tokens available per day for GPT-4o. Those free tokens are available through the next month, but also we know exactly how the price wars for Uber, DoorDash, Lyft, and others went with what seemed like endless subsidization. (Sure, that was ZIRP-era, but this time around we have GPU credits too.)

This is just a little less than two weeks after Google announced it was making its Gemini Flash 1.5 tuning product available for all developers. Gemini Flash 1.5, in addition to being very cheap, also hands out 1.5 billion free tokens per day to developers. Per Google’s pricing page, the tuning price for Gemini Flash is free of charge. Google even offers a free tier for its Gemini Pro 1.5, though it’s considerably lower volume at 1.6 million tokens per day at a much more limited rate (and per its pricing page tuning is unavailable). But the point is Google (and OpenAI) for now can afford this in the first place in the hopes of hosting the portfolios of custom models.

And while these pay-as-you-go models aren’t exactly what massive enterprises would adopt—that would more likely be provisioned usage and contracts with much more aggressive service level agreements—they serve as an easy and very aggressive onboard to getting stuff into production that can serve as a very enticing carrot for further usage.

It feels like the industry is quickly coalescing around a handful of companies with the resources to produce these kinds of powerful fine-tuned workhorse models—and they’re all trying to one-up each other in how much of a product they can put in front of companies to try to lock them into an ecosystem. We essentially saw all this before, in multiple times like the on-demand era, with companies giving away as much free stuff as they possibly could to eventually convert it to a sustainable business at some point.

And these more workhorse-focused fine-tuned models are essentially one of the end-games for enterprise focused tasks. Achieving some benefit in fine tuning for tasks like classification or summarization, in the case of Gemini, generally requires a number of examples in the low 100s.

But once you’ve collected those models, you can’t move them—OpenAI has them, and if you want to move them, you’ll have to re-do the process all over again either through Gemini or some alternate service. This is one of the main selling points of fine-tuning open source models, as you can move around fine-tuned versions of, say, Llama 3.1 8B as needed because they can port to any infrastructure that can run that kind of model.

And indeed, that’s also part of the threat that Databricks and Snowflake pose to OpenAI (along with Google and company). If fine-tuning with proprietary data is the way to unlock value—like building a custom summarization tool for my sales calls—the button closest to the data with the most optionality is going to be the most valuable. OpenAI faces the uphill battle of courting enterprises with its API that already likely have accounts with all of these abstraction layers and hyperscalers in the first place.

While we haven’t seen what it looks like just yet, the next obvious player here will be whatever rabbit Anthropic pulls out of its hat with its next-generation workhorse model. Right now you can fine-tune Claude Haiku through Amazon Bedrock, but it is nowhere remotely close to as easy as it is for the pay-as-you-go API approach that both Google and OpenAI offer. But Anthropic also has the benefit of Amazon promoting it to AWS customers, where that data for fine tuning already sits for many companies.

It also doesn’t help OpenAI that Cursor, a more flexible AI-enabled IDE where you plug in any provider’s API key, has a lot of hype among developers and investors I talk to lately. I wrote about the company several months ago about how one of Cursor’s biggest advantages was that you could pop in any code generation API key, making OpenAI even more disposable if a new superior model comes out—either proprietary or open source. (Cursor’s parent company Anysphere announced it raised $60 million this week.)

This constant deluge of free stuff shouldn’t exactly surprise anyone because all these companies are staffed with the same type of executive that lived through the on-demand era. For OpenAI in particular, both Sam Altman and chief operating officer Brad Lightcap hail from Y Combinator, the storied accelerator that spawned both Instacart and Dropbox—companies that, among others, popularized a “growth first, business later” mindset. Meanwhile, though Google has seemed to put together a rather comprehensive stack for AI, it still feels a lot like the same old Google that has been shoving free ad credits into the face of anyone who decides to open a G Suite account.

The on-demand era, and really Web 2.0 broadly, was often affectionately dubbed “venture-funded capitalism,” and came to a halt down when all that free capital dried up. With the duel in on-demand, it was largely around a handful of moat-building exercises: brand affinity, collecting data, wooing drivers (who ended up running both anyway), and altering consumer behavior away from hailing a cab or calling for delivery. For these companies, the last one was really the only one that was truly successful as they all struggled to build out long-term sustainable businesses.

About that AI bubble

Matthew Lynley — Fri, 16 Aug 2024 19:05:13 GMT

a large bubble rising above a large cyberpunk type city, blade runner 2049 aesthetic, deep blue hue, --ar 4:3 —

I had to finish this one on my phone, so apologies ahead of time for formatting or spelling errors!

The duality of the AI bubble

Much has been said in the last several months about whether or not the bubble for investment in AI is going to pop—particularly following the acquihires-slash-acquilicenses-slash-whatever of previously hyped AI startups Character.AI, Inflection, and Adept. These are companies that were going after whole generative AI experiences that required massive GPU clusters—the kinds of products that have sent Nvidia rocketing toward a valuation that at one point passed $1 trillion.

The list seems to continue to grow every week. One of the most-hyped AI-first products, the Humane Ai Pin, has seemingly gone… not well. The whole industry in and around AI rode a colossal hype wave to the point that companies even adjacent to “core AI” were picking up valuations upwards of hundreds of times annual recurring revenue.

Now, the big question is, where’s the revenue? Where is the business? Where are the killer apps? Is all this just a complete waste of time with a very limited potential industry that requires colossal upfront investments? Is OpenAI just going to go out of business?

Well, two things can be true at the same time: the foundation model providers building out various colossi and dreaming of artificial general intelligence aren’t living up to the hype right now; and AI is actually really, really useful and already realizing returns within some organizations—just perhaps not in particularly exciting ways.

Instead, the answer to whether or not we’re in an AI bubble is probably more disappointing and frustrating: it’s complicated.

The less difficult, and more valuable, use cases for AI right now

I’ve written about those types of use cases before, but let’s talk about them again: batch processing and robotic process automation (or RPA). These types of use cases are really straightforward and the kinds of things we were doing that pre-date ChatGPT and its neighbors. Summarization, entity extraction, classification, and sentiment analysis are just a few examples—but the point is that you don’t need some massively powerful foundation model that costs $10 per million tokens to do it.

In most of these use cases companies can get away with using models like a customized version of GPT-3.5 Turbo with a system in place to receive relevant information in a specialized database through a process called retrieval augmented generation, or RAG for short. Most of the foundation model providers have these kinds of “workhorse” models for a reason—they accomplish the vast majority of needed tasks and perform at a relatively high level at a relatively low cost.

This extends to what we’d consider “agents” as well. The kind of one-size-fits-all-self-reasoning-autonomous-replace-us-all digital entity seems not only far away, but in most of these very achievable cases completely unnecessary. There are some potential gains to be had through an increasing ability to self-orchestrate, but the near term also points to networks of agents that are again customized to complete a closed number of tasks before passing them off to the next “agent.”

When we talk about “AI in prod,” a which is an umbrella so comically large you could fit a small city under it, the two get bunched together. AGI is the goal to accomplish all these tasks that can… already be accomplished by what’s currently on the shelf. Developers I talk to say customer service ticket routing and escalation is a relatively straightforward mechanism through these models—a very time intensive process that requires a lot of people doing a lot of mundane work. And, to be clear, it doesn’t even replace these representatives. (One source even found some of these super complicated approaches to fine tuning and deploying small models like Llama 3 overkill, which could easily be handled by a fine-tuned GPT-3.5 Turbo.)

Revisiting that old Google AI memo

Matthew Lynley — Fri, 09 Aug 2024 19:13:54 GMT

Hope everyone had a productive few weeks! I’m back and resuming publishing, though slowly ramping back up as I get back in touch with people, so there may be fewer issues for the next two weeks or so. Thanks everyone for you patience!

A bit more than a year ago at the dawn of the comically short period that we’ll call “modern AI,” a memo from a Google researcher found its way onto to SemiAnalysis that essentially served as a warning of the risks open source AI presented to foundation model providers like OpenAI and Google. (DeepMind CEO Demis Hassabis also confirmed its authenticity to The Verge.)

The original Llama model from Meta had recently leaked, and it became pretty clear that the developer community was going to run off with it and develop a lot of new techniques to improve the performance of open source models, many of which were actually adopted by Apple for its forthcoming AI suite Apple Intelligence. But at the time the catch phrase was that Google “had no moat, and neither does OpenAI.”

A few things have changed since then! And it seems like as good a time to look at where Google actually sits in this whole mess as its latest version of Gemini Pro, its own foundation model, and its latest micro-model Gemma 2 2B both sit atop yet another leaderboard respectively for models in their class.

Google has since run this kind of multi-track approach to AI, one with its own Gemini-series foundation models with the expectation you will run up a million tokens per prompt and a platform with smaller models (including its own Gemma series models) available with Google Vertex. But while it’s been pretty easy to dunk on Google for not having a real AI strategy as its AI search products seem to have not gone so well, there might be some more subtle signals that Google might not be as floundering as it seems on the surface.

While OpenAI invests heavily in creating a suite of foundational technology across a wide variety of modalities, it’s relying on powerful models that require powerful hardware and doesn’t work with the open source community. Meta, meanwhile, has invested heavily in training a wide variety of open-ish source models. Meta can deploy those open-ish source models its products including the best learning from other users, but carry potential risks of everyone having access to the same base technology it has. And in both situations, they’re still largely reliant on hardware provided from Nvidia.

Google has found a way to do both, and has the potential to do so without heavy reliance on external hardware providers. Having a powerful foundational model product effectively gives it a SaaS business to monetize all that work around training and development for current and potential future use cases. (A Google spokesperson noted that while they did use TPUs for internal workloads, they also run internal workloads on GPUs.)

And ingratiating itself in the developer community gives it a direct line to the rapid experimentation happening there, as well as an opportunity to nudge the development arc of AI in directions that could potentially benefit Google. It’s much in the same way Meta was able to benefit from the deep learning community largely adopting PyTorch, taking some of the best learnings and applying them to Meta products.

Google has quietly assembled a kind of comprehensive stack that—while it obviously has invested in making Nvidia hardware available—could give it a level of autonomy that many of the other AI developers and providers don’t necessarily have. At a time where everyone is reliant on Nvidia, the years of development on this immense stack on Google Cloud are paying out, even if it has yet to convert all this into a compelling consumer product. In its most recent operating quarter, Google said its Cloud business for the first time passed $10 billion in quarterly revenue and $1 billion in operating profit.

That certainly includes hardware with its TPUs, but Google also owns a software stack that companies have been quietly picked up by others, with both xAI and Apple disclosing they used JAX for their development of Grok and Apple Intelligence models respectively.

In that memo, the Google researcher flagged a handful of signals that could threaten Google’s (and OpenAI’s) dominance in AI from the open source community, and by extension companies that would use open source technology. But Google has essentially covered the majority of them while also maintaining development and release of a larger, more powerful product at the same time that competes more directly with Anthropic and OpenAI.

When ChatGPT launched in November 2022, Google was essentially caught flat-footed and it had to scramble to figure something out. We ended up getting a series of poorly-received products like its attempts at an AI-powered search engine. But setting aside that consumer strategy, Google has clearly created this kind of top-to-bottom developer experience that it can use either internally or serve externally.

Google was quietly throwing its resources behind JAX for internal use cases several years ago, even before language models became a big focus in modern AI. And while PyTorch still remains a preferred deep learning framework (part of that thanks to Google’s own histopry with TensorFlow) JAX has seemingly started to make its way into larger organizations. (And let’s not forget Google created practically the original major open source language model with BERT.)

And while search continues to be Google’s primary cash engine—which (in extreme emphasis) could end up disrupted by AI—its products are all intricately linked to its own developer stack. That was true with its research breakthroughs with MapReduce and TensorFlow, and it seems like it will continue with its development of AI-based language models.

What was called correctly in that memo, and what Google has done

At the time the original memo leaked, Google had launched Bard just a few months earlier, and would launch its PaLM 2 model shortly after that while also teasing the existence of Gemini. The former was, well, not great, and the latter demonstrated an emphasis on larger models—and chasing OpenAI. (And, to be fair, Bing’s search was also widely panned.)

The whole sequence of events was comical enough that people were essentially wondering if Google was facing an existential threat as the hype around AI absolutely exploded. That hype led to colossal funding rounds in a lot of emerging companies—some of which have been effectively acquihired (and acquilicensed, I guess). And Google’s emphasis on larger models was one of a number of potential problems it was dealing with in the memo, among others, as Meta’s original and much smaller Llama model captivated the developer ecosystem.

In rough, broad strokes, this was what the researcher covered back when it leaked in May last year—much of it which turned out to be directionally where the industry would end up heading leading up to Apple’s detailing of its own on-device AI suite, Apple intelligence:

On-device model inference: While on-device turned out to be an extremely popular outcome, this can essentially be abstracted out to low-power inference—like on a CPU—which is predicated on quantized versions of smaller models. Google ended up moving into this with Gemini Nano.
Fine-tuning for personalization: Low-rank adaptation, at the time one of a handful of experimental techniques the open source community was running off with, turned out to be a preferred approach for customization—including being deployed by Apple.
The quality gap for large foundation models was closing: If the leaderboardification of AI has taught us anything, it’s that all these larger models developed by Google, Anthropic, OpenAI, and others are very close to each other in quality and the one-upsmanship doesn’t look like step function improvements.
Focusing on larger foundation models was slowing Google down: At the time, the hype was around building a ChatGPT competitor that could do a whole lot of stuff all at once. But Google essentially showed it could do a bit of everything and do it pretty well, though it’s not clear exactly what the kind of payout on the other end will be for its Gemma series models.
People might not pay for a restricted model behind an API if free open source models are available: We’ll get to this one in a second, but did not end up entirely correct—though that could still change in the future. But we still did, indeed, see a rapid race to the bottom on price.

At the time, companies were hoovering up as much Nvidia hardware as they possibly could, which led to an overall shortage for GPUs. Meanwhile we started to see projects like Llama.cpp (and eventually Ollama) start to demonstrate the capabilities of smaller versions of models through a process called quantization that were able to run on edge devices like laptops.

The search for the next class of AI founders

Matthew Lynley — Wed, 26 Jun 2024 22:47:41 GMT

a happy scientist in a lab coat leaning onto a desk tinkering with a robot, the desk is very messy, the scientist's face is close to the robot as he turns a screwdriver on it, there is a school diploma and a graduation cap on the desk as well as a stack of papers that say "OFFER" on it, the robot is small and looks friendly, 45 degree angle camera sunda…

For Apple, it's RPA all the way down

Matthew Lynley — Fri, 14 Jun 2024 22:29:50 GMT

a small and very advanced, smiling robot sittin in a large sandbox in a playground, there are lots of spherical objects in the sand, several emojis are also lying in the sandbox, sunday funny comics aesthetic, --ar 4:3 — midjourney

Good morning and happy Friday everyone! Today we’re covering a handful of topics including Apple’s relatively unremarkable A…

Lakes, catalogs, and Snowflake's full court press on AI

Matthew Lynley — Tue, 04 Jun 2024 20:49:32 GMT

two sleek, friendly and smiling robots playing chess with each other, the setting is union square in new york city in a crowded environment, sunday funny comics aesthetic, --ar 4:3 — midjourney

For the majority of its life, Snowflake was synonymous with analytics through its tools as as a data warehouse. It had made some exploratory moves into machine learning as it saw a new rival creeping up, including the abrupt $800 million acquisition of the Python developer platform Streamlit, which at the time had practically negligible revenue.

In the year and a half-ish since the launch of ChatGPT, a lot has changed. Snowflake has a new CEO, Sridhar Ramaswamy, through its $150 million acquisition of AI search startup Neeva. It’s announced a series of plays in AI with the launch of Snowflake Cortex last year. And this week at its Snowflake summit, its announcement largely sat in AI’s orbit. It has a new catalog tool (Polaris), a new observability suite (Snowflake Trail), a notebook tool, and many others that provide a heavy emphasis on AI.

What was a database company under its past series of leaders is increasingly a company focusing its efforts on building the developer and operational frameworks for AI. It’s not exactly a pivot—Snowflake is still the data warehouse company—but it it’s not… not a pivot.

People tell me that under its new CEO, Sridhar Ramaswamy, all hands on deck for AI is an understatement at Snowflake. A major sense of urgency that comes down from Ramaswamy—particularly the risk of missing the AI wave—is powering a lot of the extreme “build or buy” mentality in AI.

“[Snowflake’s new leadership] very, very genuinely believes—and for that matter, correctly, in my opinion—that we are essentially in a once in a generation technological inflection,” Adrian Treuille, head of Snowflake’s developer platform Streamlit that it acquired for $800 million, told me. “Five years ago, most machine learning researchers thought we were decades away from talking to decades away from talking to computers…. And now it's like, ChatGPT, yeah, no a problem. It's amazing. And so I would say that the the thinking about the importance of AI is is genuinely first principles.”

That’s materializing in a lot of forms, this week with the launch of a series of Iceberg- and AI-focused products. But Snowflake is also clearly in dealmaker mode, most recently evaluating Reka AI for a potential $1 billion per Bloomberg. A month prior to the deal petering out, Snowflake had launched its own large language model, Arctic, as well as a series of embeddings models. And while Databricks was the ultimate buyer for Tabular, an Iceberg storage platform, sources tell me Snowflake was aggressively in the mix for the 40-person startup going all the way back to April.

All this is pretty reasonable to expect for any company in the data abstraction layer among what are effectively considered among sources the three leaders: Snowflake, MongoDB, or Databricks. Enterprise data, which in many cases took years to get onto these platforms, is ripe for usage in modern AI. In fact, most of the value for AI is locked up in that data—either in the form of accessing it through retrieval augmented generation (RAG) or customizing the models directly with additional data.

While OpenAI’s fine-tuning APIs are incredibly straightforward to use, those same enterprises—especially ones with stricter privacy standards—aren’t likely to ship out their data to some API they don’t have direct control over. And that gives those data abstraction providers the ability to just add a customization or retrieval layer right on top of the data that enterprises have direct control over.

The difference here is that Snowflake, before the launch of ChatGPT and modern AI, was long considered a laggard in supporting machine learning. It didn’t support Python development until the launch of Snowpark in 2021. Databricks, meanwhile, had aggressively started in the opposite direction and then inched closer to Snowflake’s business over time.

But one of the biggest blockers to getting all these applications into production is having the suite of tooling around them—lineage, governance, development, observability, and to the extent possible, explainability. Snowflake, with its announcements this week, is bulldozing into all of those categories on its way to trying to be a backbone for AI development.

At its conference this week, Snowflake has effectively tried to brand itself as an “AI data cloud.” It’s a far cry from its summit in 2022 when one of its marquee announcements was Unistore, which was going after one of the holy grails of database technology in unifying analytical and transactional workloads.

Snowflake was long considered to be one of the key players in the abstraction layer, which has gone on to become a fundamental part of AI deployment. And with its launches it’s effectively growing into that role.

Lakes, warehouses, and another small startup worth more than $1 billion

Arguably the two most significant announcements from Snowflake this week was Polaris, a new open source catalog format built on Iceberg, and its suite of observability tools Snowflake Trail. Catalogs effectively serve as the governance layer of a data store, providing access and logging around data within an organization. And observability—basically the monitoring of performance of a given product—is emerging as a key requirement for many larger enterprises in order to graduate out of that proof-of-concept stage for AI.

Modern AI broadly is built around the premise of data lakes as a backbone for the abstraction layer. You take all unstructured data—PDFs, images, emails, transcripts, whatever—pre-process it into something more coherent and accessible, and shove it into a semi-accessible format. It’s also led to the birth of a whole chain of startups focused on piping that unstructured data into an unstructured data store like unstructured.io, Datavolo, Reducto, and others.

Snowflake and Databricks have also embraced competing formats for data lakes. While Databricks has its own format (that’s now open source) in Delta Lake, Snowflake effectively bet on the open source format Iceberg. Its announcements to date have all relied on Iceberg, which has proved to be an extremely popular file format.

While ongoing for several years now, Snowflake’s embrace of data lakes is somewhat of a break from its long singular focus on data warehouses. Founded in 2012, Snowflake pounced on the decline in the cost of cloud storage to build out what effectively remade the business intelligence and analytics layer and gave birth to a whole suite of tools to radically improve the quality of analytics. We generally refer to all these tools, including Dbt, Alation, Monte Carlo, and others, as the modern data stack.

As machine learning emerged as a much-larger-than-anticipated market with techniques and tools maturing on the cusp of the launch of ChatGPT, Snowflake pressed into the world of unstructured data—and data lakes. In late 2022, Snowflake launched Iceberg Tables, embracing the open source format that came out of Netflix four years prior. (Databricks, at the time, was already seeing usage of large language models in the form of BERT.)

Machine learning tooling built on top of data lakes, though, had long been Databricks’ sweet spot. Databricks announced the availability of Delta Lake in 2019, and in 2022 it announced that its 2.0 version of Delta Lake would be open source. And its 3.0 version, launched at last year’s summit, was also effectively a shot at a universal format for data lake management.

During all this time, Databricks has long tried to espouse the paradigm of a data lakehouse—a unified approach to data warehousing and lake management that allows for execution of all tasks, whether that’s business intelligence or machine learning. And Tabular, a storage platform built on Iceberg, was one of the best positioned tools to grow that paradigm.

The acquisition of Tabular, though, has been floated for weeks, going back to even April. The most recent number I had heard floated on the Snowflake side

A new whisper in the AI analytics room

Matthew Lynley — Fri, 31 May 2024 21:56:45 GMT

a number of small, sleek and happy robots are standing in a laboratory setting that looks like it is out of a Pixar movie, in the center of the lab is a large and much older steampunk robot staring down at the smaller sleek robots, the steampunk robot is waving jovially at the more sleek and advanced robots, --ar 4:3 — midjourney

We have both Snowflake and Databricks’ annual summits coming in the next two weeks with a high volume of events going along with them. As part of that the next few issues will be compressed with multiple topics (like this one) to account for the time sink they are going to cause.

Way too many coding models and a Bring Your Own Key paradigm on the horizon

One extremely popular emerging tool among developers that consistently ends up in the same echelon as Ollama (in terms of how fervent their fans are) is Cursor, an AI-powered developer tool that’s essentially a replacement for VSCode.

There are a lot of reasons Cursor is popular as it heavily streamlines the coding experience with tools like, for example, making a codebase searchable. But there’s another substantial difference when compared to VSCode: its model selection is more open-ended.

And going beyond just a tool itself, Curor it also offers a window into a future workflow scenario for one of the most popular use cases for language models: bring your own key. Rather than relying on a GitHub Copilot integration and living in the Microsoft Extended Coding Universe, you could select the best model for your specific problem—or, maybe at some point, a powerful model run locally.

Cursor allows you to drop in both your OpenAI and Anthropic keys to swap between each during sessions. And, perhaps more importantly, you can override the OpenAI API to configure it for other endpoints, like ones from Together AI. That means developers can pick the best/cheapest/most directly helpful models for their development environment. And it offers an opportunity for other endpoint providers to sneak directly into development environments.

And while GPT-4o, Gemini 1.5, and other top APIs will likely capture the majority of code development usage, Cursor gives open-ish source model providers (as in Meta and Mistral) ways to sneak closer into developer environments. Even the coding model ecosystem continues to grow. Meta released CodeLlama last year, and Mistral this week released its own coding model, Codestral.

And all these endpoints are, of course, a race to the bottom that’s only accelerating. Let’s revisit the pricing chart for the nth time here!

The process within Cursor itself has a few hurdles to set up—you have to have access to an endpoint with an API key in the first place, so it’s not a “click it and get calls” product like GitHub Copilot (though there is a pro option with Cursor tied to GPT-4). It also requires managing a variety of different keys, which could involve jumping between developer accounts.

Right now, Anthropic and OpenAI those are two of the best APIs. But a number of others are quickly catching up based on the LMSys chatbot arena. That includes Llama 3 variations, but it also includes potentially specialized ones like Cohere’s Command-R+.

Cursor effectively offers another avenue for leveling the playing field here. It probably isn’t the last new development tool to exploit that kind of bring-your-own-key approach.

But it also feels like an early sign that another of OpenAI’s biggest advantages—its deep integration with GitHub Copilot—has some signs of vulnerability.

Informatica gets a second look

Earlier this year, The Wall Street Journal reported that Informatica, a legacy vendor for data management, was in advanced talks to be acquired by Salesforce. The Journal reported a little more than a week later that those talks had broken down after being unable to agree on terms.

While that was a very, uh, loud story for Informatica, the reality is the company has come up more and more lately in the context of developments in AI. Or, in particular, Informatica has found a way to tell a story around its potential AI services. At its conference this month it unveiled a suite of updates and integrations—particularly with Snowflake’s AI platform, Cortex.

Informatica is one of many examples of what feels like another growing trend with the emergence of AI tooling for legacy vendors: a second shot at reclaiming ground lost to challenger startups and cloud-native vendors. In the case of Informatica, that’s the modern data stack, which has picked off nearly every part of its business into independent products.

This week’s newsletter is brought to you by Felicis

Felicis has been investing in an impressive variety of AI and infrastructure companies, including: Runway, Weights & Biases, MotherDuck, Supabase, Metaplane, Vannevar Labs, poolside, and Flower Labs. A generalist firm started by Google’s first product manager, Felicis is known for supporting founders with their Founder’s Pledge.

Read what founders say about working with Felicis.

OpenAI's Silicon Valley pivot

Matthew Lynley — Wed, 22 May 2024 22:02:04 GMT

a startup bro in a wearing a black turtleneck, jeans, and tennis shoes presenting on stage like an Apple keynote with a remote in its hand and a laptop on a pedestal. On the very large screen on stage is a small logo that is spiraled and black and white, sunday funny comics aesthetic, --ar 4:3 — midjourney

Since we have a holiday weekend coming up and I’m still on the other end of a heavy travel week (as well as getting ready for a heavy conference month in June), two columns this week are kind of smashed into one with some additional notes on WebGPU and WebLLM. Enjoy the long weekend!

In January 2007, Silicon Valley luminary Steve Jobs took the stage to explain a device that was “a phone, an iPod, and an internet communicator.” He followed it, to a roaring crowd, by saying, “are you getting it yet?”

Seventeen years later, a handful of employees sat in a room in Silicon Valley speaking with a cheerful synthetic assistant that sounded close enough to Scarlet Johansson’s voice in the film Her that it might as well have been Johansson. And it was lying in the same device: an iPod, a phone, and an internet communicator.

But this time around there’s nothing to “get,” as the most iconic Silicon Valley companies have already made casually pulling products out of science fiction—self-driving cars, buying anything with a press of a button, speaking to friends across the world on demand, and access to nearly the entire corpus of human knowledge—into reality almost a rite of passage.

It seems increasingly obvious OpenAI, and its leader Sam Altman, want to become part of the enduring Silicon Valley lore with a truly Silicon Valley product: a feat of engineering that’s, quite literally, straight out of a movie. And in classic Silicon Valley fashion, there’s simply the expectation of a number of variable sized bumps along the way. Or, as we used to say, “move fast and break things.”

All this, though, is more or less the logical conclusion of the last six months of OpenAI’s operations. Altman, ousted by the board in November last year, engineered a return that presaged a pivot from a research firm into an achieve-the-impossible consumer technology company.

OpenAI has effectively shed its moniker as a research institution in recent weeks not only through its unveiling of that synthetic assistant, but by seemingly moving to an ask-for-forgiveness-not-permission approach to development. Its chief scientist Ilya Sutskever left, and Wired reported that its Superalignment team has disbanded as well. And this week it is dealing with the fallout of that same giggling digital persona sounding way too close to Scarlet Johansson, who declined to participate in developing its “Sky” voice, per a statement from NBC.

OpenAI is now creating one of the kinds of culture-altering company Silicon Valley entrepreneurs dream to achieve—through whatever means are needed to do so. It’s why so many people come to San Francisco in the first place, and with Altman at the helm, OpenAI is becoming a quintessential product-led Silicon Valley company.

Altman has always had very deep ties to Silicon Valley, the San Francisco Bay Area, and San Francisco itself well into the early mobile tech boom. His growing influence in Silicon Valley, before taking over OpenAI, culminated in becoming the president of the industry’s most prestigious accelerator Y Combinator. You can really look no further than the large expat contingent of Stripe—one of Y Combinator’s most successful companies—within OpenAI. Airbnb CEO Brian Chesky’s, another of Y Combinator’s greatest success stories, has emerged as one of Altman’s strongest allies, which you can see in much more detail in an in-depth report in The Information.

Altman quipped that its new voice assistant was straight out of “Her,” a film that received a smorgasbord of Academy Award nominations—including winning the best original screenplay in a year where it was competing with films like American Hustle, Twelve Years a Slave (the Best Picture winner that year), and the much-nearer-future Gravity. Her, in addition to being a great film, offered a glimpse at a future with truly humanlike, empathetic synthetic assistants.

Ever since the film came out, Silicon Valley has desperately tried to make that a reality. But speech technology like Alexa, Cortana, and Siri have long continued to be a dumpster fire and the pathway to that product seemed to just not exist. It was magic in a future that was so far away we couldn’t even see a timeline away from the limited use cases something like Google Assistant could offer.

OpenAI’s unveiling of its latest advanced model, GPT-4o, might as well have been a footnote. Whether or not its voice assistant ends up becoming the ambient companion tech has long dreamed of, OpenAI has also shown that it’s no longer just a research firm creating an API that can power the next-generation of applications.

Instead, OpenAI has pivoted to a company chasing the culture-altering step-change in technology that entrepreneurs have long dreamed about. And it has done so not once, but if you include DALL-E 2, potentially three times—a feat rarely achieved by even the most successful companies in Silicon Valley.

Move fast and break things

Regardless of what’s happened in recent weeks, OpenAI’s voice assistant unveiling was a triumph. The company, hardly a startup any more, needed to create some kind of step-change cultural moment in AI when the vibes were clearly of the “AI is hitting a wall” kind. OpenAI also wasn’t doing itself any favors by running simultaneously one of the most expensive and less performant APIs (on standard benchmarks) for light-to-medium-difficulty problems in GPT-3.5 Turbo.

The “vibe” of its demo was more akin to the release of the iPhone, iPad, or the original release of Google’s we-know-what-you’re-looking-for Instant Search product.

OpenAI, in its last few weeks, effectively looks like a company run by a classic Silicon Valley executive. At the time of his ouster, it represented an interesting potential turning point for the industry that is largely flooded with academics. It’s no surprise that Microsoft, OpenAI’s largest partner, would work as hard as it could to keep an executive like Altman in place.

In April 2009, Y Combinator founder Paul Graham penned an essay about five of the most interesting founders in the last 30 years that at the time included Altman alongside Silicon Valley legends Steve Jobs, Larry Page, and Sergey Brin. Here’s part of what he said in the essay:

What I learned from meeting Sama is that the doctrine of the elect applies to startups. It applies way less than most people think: startup investing does not consist of trying to pick winners the way you might in a horse race. But there are a few people with such force of will that they're going to get whatever they want.

That “force of will” is the north star of entrepreneurship in Silicon Valley. You don’t win on ideas, you win on execution. And the founders listed alongside Altman, and if we wanted to include Mark Zuckerberg as well, routinely skirted with the absolute boundary of what was acceptable at the time—often crossing it just barely before finding a way to avoid the harshest penalties.

In his mission to build an iconic phone, the iPhone 4 brought about the infamous “antennagate” scandal, wherein if you held the phone at the right spots the signal would die out. Apple’s mea culpa here was, well, free iPhone 4 cases for everyone.
While Facebook was routinely the subject of privacy imbroglios, it’s easy to forget that Meta was under a twenty-year consent decree that the FTC hit it with back in 2012. The FTC found in 2019 that it had violated that consent decree and slapped them with a $5 billion fine.
Meta also came under fire for an experiment that intentionally seeded Facebook’s News Feed with content to, essentially, see if it altered the mood of its users.
Google also paid the FTC a fine in 2012 for bypassing Safari’s privacy settings (for a comically low $22.5 million). Google also paid a $170 million fine for violating another privacy law in 2019.

You could walk through the history of the majority of all these iconic tech companies and find these kinds of skirmishes with the norm. That behavior and method of operating distilled down to an early motto at Facebook: move fast and break things. And in AI, one of the strongest “norms” is a heavy focus on safety for a technology with an extreme potential for causing chaos, even if the results are a little mixed.

Altman, born and raised in the classic Silicon Valley mold, is essentially just continuing a long tradition of testing the boundaries as hard as physically possible. You could point to any one of its ongoing engagement with The New York Times over copyright issues, sidelining its Superalignment team, or effectively remaking an iconic sci-fi character without direct involvement with the actress behind her (it?).

It’s not the only voice-cloning scandal we’ve had, though it is certainly the highest profile as it was effectively part of the brand behind its voice assistant (with Altman literally tweeting the reference to Her). The stakes were, and are, high enough and not just limited to Johansson that it made it into the SAG-AFTRA collective bargaining discussions in the industry.

And Y Combinator, practically an institution in Silicon Valley, is built to attract founders and teams that are ready to pull out all stops to build a successful startup. With one of its former leaders at the helm, we could only expect these myriad disputes to be just an early signal of what’s to come as OpenAI looks to remake cultural norms in both tech and beyond.

The fine line between enterprise, consumer, and icon

But despite the release of ChatGPT, which created one of those cultural moments in tech, OpenAI was largely what you’d consider an “enterprise” business not unlike Stripe. It offered a suite of APIs that would potentially power a next generation of technology products, though for the time being that technology is still looking for its killer use case. But most of the experimentation here is, well, enterprise use cases (such as RPA).

The release of Llama 3, which in many standard benchmarks was highly competitive or more effective than GPT-3.5 Turbo, exposed vulnerabilities to OpenAI’s core enterprise business. A flurry of startups threw up endpoints that could effectively replace GPT-3.5 Turbo with a few lines of code. OpenAI did what it could do—with a price cut to GPT-3.5 Turbo and the release of a batch processing API—but its first-mover and convenience factor advantage was starting to quickly degrade.

Rather than getting a GPT-4 “Lite” model to address that problem, we got a semi-equivalent multi-modal model to GPT-4 that is half the price of GPT-4 Turbo. And while that’s an enormous price cut compared to GPT-4 Turbo (and GPT-4 Classic), it still doesn’t tackle the workhorse model problem OpenAI now faces. Instead, GPT-4o is more in line with competing with Claude Sonnet and Opus, Cohere’s Command-R+ (which developers seem to love), and Mistral’s latest higher-tier mixture-of-experts model and its mistral-large model.

When we step back and look at the pricing in that light—and what OpenAI is trying to do here—it starts to make a little more sense. GPT-4o’s goals are more aligned with what Gemini 1.5 Pro, Mistral’s large model, Claude Opus, and to a certain extent what Command-R+ are trying to achieve: a premium tier. Here’s how the pricing breaks down. (I’m not going to include GPT-4 Turbo, GPT-4 Classic, or Claude Opus, as they basically break the chart and slightly miss the point.)

AI in April (and Q2): RPA in focus, holistic evaluations, and eyes back on Datadog

Matthew Lynley — Fri, 10 May 2024 22:54:42 GMT

a friendly-looking pixar-styled robot standing in front of a table that is covered in ants, the ants look like robots and are all carrying leaves, there are NO cables coming out of the robot, sunday funny comics aesthetic, --ar 4:3 — midjourney

This issue is going to be a bit compressed and cover multiple topics as I’m recovering from a hand injury and a minor illness. I apologize in advance for any typos and whatnot.

Next week Google and OpenAI are both going to be making announcements on their AI products, with OpenAI sliding just ahead of Google with a livestream on Monday.

Sam Altman on shut down whether this is an announcement around a Perplexity competitor or GPT-5 on Twitter. But OpenAI says the announcements will be updates for ChatGPT and GPT-4 right before Google’s developer conference kicks off. And we’ll probably get another whole new list of startups OpenAI and Google might-but-might-not-but-possibly smush in the process.

OpenAI obviously has to contend with Google and its continued improvements, as well as Meta’s largest Llama 3 (400B+) that’s still in the works. But maybe we’ll finally see a response to all the pressure OpenAI has faced on its workhorse GPT-3.5 Turbo model, particularly with the release of the Llama 3 models. Llama 3 70B is both cheaper and more performant than GPT-3.5 Turbo on standard benchmarks. Cost and simplicity has long been one of OpenAI’s advantages, but that’s slipping away as more startups offer drop-in replacements for GPT-3.5 Turbo with an API powered by Llama 3.

What that would look like is unclear, but it wouldn’t be all that surprising if OpenAI came out with some kind of lightweight GPT-4 or upgraded workhorse model alongside whatever else it has planned. (A hub called the LMSys chatbot arena, where users compare results from competing models and rate the better one, has been exploding with speculation for a series of cheekily-named models that seemingly suggest they are from OpenAI. )

Now, with that speculation out of the way, let’s get to the actually fun stuff for the quarter so far: what everyone’s arguing about in AI lately. (I’ll be recapping conversations here from April up til today, even though this is technically supposed to be focused on April.)

The whole retrieval augmented generation (or RAG) pipeline is coalescing into a rather sensible (if not sprawling) chain of startups and tools. And, as usual, language models are getting closer to production—though the timelines still range from a few quarters to more than a year, depending on who you talk to.

From everyone I’ve spoken with lately, the key word this month (and the quarter) seems to be practicality. Not in the sense of whether a product built on a language model can or can’t be done—but whether it should be done in the first place. Experiments are giving way to analysis around cost, evaluation, and the simplest places to start to extract value.

So, with that said, here’s what everyone is talking about as we head further into the second quarter of the year:

RPA gets a closer look. Robotic process automation—basically that work around automatically clicking around on a website—is increasingly seen as a spot ripe for disruption by language models. In particular, RPA is seen as an initial stepping stone toward a much broader automation network.
Long context versus RAG. There’s a universal agreement that jamming more relevant information into a prompt yields better results. But now there’s this emerging debate-ish over whether gargantuan context windows or retrieval augmented generation is the way to go.
Evaluations beyond just the benchmarks. While a lot of these larger models try to one-up each other on performance benchmarks, a lot of companies are now facing a struggle of how to evaluate whether one did the thing they wanted to do. And it turns out that’s as much of an aesthetic review than it is for basic performance metrics.
Eyes are back on Datadog for its next move in AI. Whispers are starting to trickle in around Datadog’s intent to enter AI pipelines more formally this year. Like MongoDB with vector search, investors and users are waiting to see what Datadog—which is already deeply entrenched in enterprises—does in AI observability and evaluation.

Let’s dive into each, starting with RPA!

RPA gets a closer look as a primitive agent

One topic that routinely comes up is whether investing in the infrastructure layer—the startups like Ollama, LangChain, Chroma, or ElevenLabs—is saturated and the focus should be going to the app layer on top of all that infrastructure. There are a lot of natural use cases for AI that everyone tends to point to, like talking to legal documents, code generation, customer service chatbots, and so on and so forth.

Increasingly, though, there’s another broader focal point emerging on that app layer from experts and investors: robotic process automation, or RPA. And while these automations—a fancy way of saying a bot acting like a human in some form to complete a simple task—are currently more discrete, we already have a term for what would come next with substantially more advanced self-orchestrated RPA: agents.

RPA is seen by a lot of companies and investors as a kind of low-hanging fruit that they can quickly capitalize on, either through the use of smaller models or just calling one of the workhorse endpoints like GPT-3.5 Turbo. Most companies reflexively point to the case study put out from Klarna as what can be accomplished with a language model under the hood just routing requests to specific products, FAQs, or support pages automatically.

Modern RPA has largely been relegated to manual tasks bots can accomplish autonomously, such as, well, clicking around on a website to figure out if stuff works or how to break an app. The challenge with all that “clicking around” is minor changes in workflows and products can eject said bot out of a designated pathway and render it useless (or outright counterproductive).

This is where language models can potentially plug in here. The extreme version is that all app interactions are through natural language, which you would… automate with a language model. But in the meantime, language models can potentially filter through customer support tickets for escalation and de-escalation, or fiddle around with websites to find ways to break them.

UiPath is largely the company with a target on its back here. UiPath did announce a family of language models in March this year. But founded in 2005, UiPath effectively created a business that’s valued more than $10 billion. Meanwhile, Microsof, purveyor of Copilot buttons, has—shocker—a copilot for its RPA product Power Automate.

Where graph databases live in a future AI data stack

Matthew Lynley — Wed, 01 May 2024 23:09:00 GMT

a small friendly pixar-styled robot looking up at a large chart, the chart is connected with multiple dots that stretch out across an entire large wall, sunday funny comics aesthetic, --ar 4:3 — midjourney

If 2024 is supposed to be the year of “AI in production,” the first step to getting there is throwing as many ideas at the problem first and hoping for the best on the other side.

We already saw that with the emergence of retrieval augmented generation, or RAG, which at first seemed like a bandaid but now seems like a permanent fixture. But even those new permanent fixtures have limitations, and there’s still a bar of quality (and practicality) for all these newer model-powered products to cross.

Now, another big data technology may be getting a closer look as these kinds of applications move closer to production and demand higher-quality or more accurate responses: graph databases.

More specifically, some companies are exploring implementing graph search into a RAG setup to improve the relevancy of the results. Like many other once highly specialized technologies, such as vector databases, graph databases potentially offer another layer of relevancy to deliver the right information to a prompt and improve the quality (or accuracy) of the results.

“You can think of graph independently and vector independently as one way to store features—graph captures the connections much better than key-value, and vectors do the semantic capturing much better in a more condense format,” Naren Narenden, chief scientist at Aerospike, a provider of graph and vector databases, told me. “Independently they work well as new kinds of feature stores, but you can see how they can come together. You can have a RAG application that starts out with semantic search that identifies a few key nodes as a first step, and use the graph to gather the rest of the information.”

The idea is pretty straightforward: instead of using just a graph database, or a just vector database, why not both?

Developers use the same kind of classic RAG approach to retrieve information, but in this case, it’s retrieving nodes—a given point with a lot of information connected to other points—and then running a much more intelligent graph search.

Graph databases and graph search was already in use in cases where one entity—say, an e-commerce customer or financial transaction—has a lot of attributes and is intricately linked to many other concepts. You could see a new transaction show up (which would include location, card number, vendor, amount, and so on) and quickly chase down a lot of information about it to determine if it is an anomaly and should be flagged as fraud.

That includes a process called traversal, where you search for information by literally move from node to node to find what you’re looking for. And that’s largely limited based on where you start on the graph in the first place. But taking a piece of information coming in and using the classic RAG playbook—embedding it, searching for similar results, and spitting out relevant nodes—gives you more ideas of where to start on the graph.

And it turns out it helps potentially alleviate two of the major problems each type has: you have more ideas of where to start search on a graph; and you get a better understanding of relationships and connections than you might get in a vector database.

“Vector spaces are completely opaque—it’s a bunch of numbers, and human beings can’t parse that,” Emil Eifrem, CEO of graph database provider Neo4j, told me. “Graph spaces meanwhile are completely explicit. I show an apple and a tennis ball, and vector similarity search will say they’re similar but not tell us why. It’s some dimensions in some kind of latent space out there. In graph space, though, an apple and an orange we see are related explicitly because they’re both fruit.”

For now, it’s still a two-step process for those providers, and it doesn’t show up all that often. But as companies move from proof-of-concept to actual production and understand the real performance needs of these apps, anything and everything is up for discussion. And that includes determining whether graph databases fit into a more modern AI data stack.

The return of graph databases

Perhaps the most well-known startup in graph databases historically is Neo4j, which was founded in 2007. Neo4j last raised $325 million at a valuation over $2 billion in June 2021, and said in January that it had passed $100 million in annual recurring revenue. Neo4j is backed by GV, Greenbridge Venture Partners, Eurazeo, One Peak, and several others. (And ironically, like many older big data startups Neo4j had fallen so far off the radar of most that I talk to that pretty much everyone was curious what they’re up to.)

Graph databases have typically found a place in financial services, particularly around fraud prevention, as well as recommendation engines. Each person or transaction has some kind of unique information that is intricately linked to other pieces of information. In the case of a recommendation engine, it could be a user’s prior purchasing or search behavior, which can change in near real time.

Vector databases, meanwhile, are well suited for that flood of unstructured information. Lots of different kinds of data sources—from PDFs to Slack messages or confluence pages—can all get dumped into the same place for retrieval. Graph thrives on understanding those relationships, and as a result, needs something that falls in the middle.

Graph databases have certainly not gone away, though they have been traditionally thrown into the bucket of “very important for some use cases.” And that’s still going to be the case going forward, as the technical barrier for implementing this dual setup is pretty high. Still, this month the International Organization for Standardization actually published GQL, a sibling language to SQL designed for graph databases.

This week’s newsletter is brought to you by Felicis

Read what founders say about working with Felicis.

Vibes-based evals and Weights & Biases' second act

Matthew Lynley — Thu, 25 Apr 2024 20:02:34 GMT

a tiny friendly robot holding a small hammer standing in front of a giant computer screen, the screen takes up an entire wall that dwarfs the robot, the screen has multiple graphs, sunday funny comics aesthetic, --ar 4:3 — midjourney

Author’s note: covering a few things today starting with some notes on Llama 3 for free subscribers, and then a deeper dive into Weights & Biases and its next act and Snowflake’s latest move into LLMs.

OpenAI’s price advantage is eroding even further

Last week, Meta did the Meta thing where it released a family of new Llama models, which are smaller open-ish source models that are trained on a ton amount of data for a really long time. (The largest of the family isn’t done training.)

OpenAI already had to contend with a series of products that were highly competitive with GPT-3.5 Turbo, it’s workhorse model designed for most tasks. Its advantage was it was comically easy to implement, even as other providers started to release drop-in replacements endpoints over the past year that only require a few lines of code.

As expected, The Llama 3 models that came out are good. There are more than enough technical teardowns out there and I’ll leave it up to the experts for that analysis. But its middle-of-the-road version, Llama 3 70B, is reaching the point where giving up that kind of consolidated enterprise experience—where you get a bunch of tools beyond just text completion in a single product—is actually starting to make sense because it’s really cheap.

That combination of pretty cheap and all-in-one has long been OpenAI’s advantage. You are only paying one company for a lot of stuff. But Llama 3 70B pretty much smushes GPT-3.5 Turbo when it comes to most benchmarks (yes, we’re doing this again), while still being cheaper than its workhorse GPT-3.5 Turbo.

OpenAI was already staring down a lot of new “rivals” that were basically riding the open source wave to offer really competitive products to GPT-3.5 Turbo thanks to releases from Mistral. Together AI, Fireworks, and others all rolled out endpoints for Mixtral which offered GPT-3.5 Turbo level scores for a fraction of the price.

But! That still comes with a pretty steep operational tradeoff. Simply chasing endpoints and optimizing for cost runs the risk of having to juggle way too many vendors and create operational headaches even if it’s shaving off a substantial portion of usage—especially if teams are just running wild with corporate cards. This is especially true for companies that are shipping data to these API providers that are a little less proven (and less-funded) than OpenAI.

The majority of AI in production is still batch processing, which focuses on the use of smaller models. While there’s a class of companies that rely on results that are at a GPT-4 Turbo level of quality (particularly for copilot products), most companies aren’t there and probably won’t be there for a while. Those smaller models are usually good enough (particularly if there’s some light customization on proprietary data).

OpenAI now has to contend with whether Llama 3 creates both an operational and a behavioral shift. Something that’s in shooting distance of GPT-4 is now widely available, and it’s at a comically low price from a lot of providers that have already established themselves as reliable compared to GPT-4 Turbo. Setting aside scores and all that, this is a really tantalizing proposition for companies that have held out on deploying GPT-4 level products due to concerns around cost, latency, security, and reliability. And we haven’t even gotten to customization of Llama 3.

In addition to significantly cheaper and almost good enough endpoints available, most companies at this point understand that hosted versions of all of these—such as those through Together AI—push that cost down even further. The barrier to getting dedicated infrastructure is also dropping, which is also manifesting in a proliferation of pre-training.

It’ll take a while before we start to see how Llama 3’s largest model performs in the wild relative to the degree that Meta has handcuffed it. But these benchmarks are the only things that currently present a buyer an apples-to-apples comparison, and they are really favorable to Meta. (The reality is these comparisons are obviously much more complicated, which we’ll get to in a second.)

OpenAI still offers what’s probably the easiest to use fine-tuning API out there with GPT-3.5 Turbo, which when you talk to developers is one of its more underrated products. But all these endpoint companies are incentivized to push customers to dedicated instances, and they’re going to finesse that pathway to customizing models like Llama 3 to the best that they can.

Inevitably this all benefits Meta. They get new tools to build into their own products and doesn’t have to pay another provider. Their intricate institutional knowledge allows them to squeeze even more power out of something like Llama 3 in the same way it can do with PyTorch. And they build up an insane amount of goodwill amongst developers—which inevitably comes at the expense of more closed developers.

Companies now have a really clear and cheap pathway to more advanced use cases for language models. What that looks like is still unclear (absent of an explosion in new copilots), but Meta may once again be forcing everyone’s hands in a more unexpected way.

Weights & Biases makes a bet on software developers

Before the launch of ChatGPT, one of the hottest areas in machine learning investing was a category we (arguably haphazardly) referred to as machine learning ops—or MLops.

That category comprised a lot of different startups and emerging technologies, such as feature stores with Tecton or frameworks like Anyscale’s Ray. But perhaps the biggest darling among investors and developers of the MLops category is Weights & Biases, which manages one of the most critical workflow components of machine learning: experiment tracking.

But Weights & Biases has also turned into a test case for the single biggest questions for startups in MLOps stack: can these startups building tools for machine learning create an “act two” for a new generation of technologies powered by language models?

Weights & Biases is trying to answer that with the launch of Weave, an updated suite of tooling for evaluating and debugging language models in production. And perhaps more importantly, Weave is targeted at the software engineers that are tasked with putting language models into production in some matter—a departure from its traditional focus on machine learning engineers and classic machine learning.

“We don’t think about machine learning engineers as a subset of software developers, we think about them as something different, and that’s really helped us,” Weights & Biases CEO Lukas Biewald told me. “Now there’s a kind of software developer that’s developing generative AI, and they’re actually having to learn this experimental workflow that’s brand new to them." It’s actually not obvious to some software developers that they should even do that.”

Weave represents a lot of things, but if we wanted to oversimplify it here, it’s effectively a framework to debug LLM apps and (more importantly) try to formalize vibes-based evaluations. While developers fight to one-up each other leaderboards with benchmarking scores or Elo ratings, the reality is the “success” of an LLM internally at companies is often determined from eyeballing the results. They look at loosely-defined evaluation scores (sometimes even based off regex), or customer feedback, or even the results themselves, and are like, yeah, looks good to me.

Language models have inspired a whole new cohort of developers, and many of them are hacker-types that just want to string together a bunch of APIs and tools into something fun. Most language models in production today aren’t necessarily concerned with complicated processes like model lifecycle management. Instead, you can just drop an API into some frontend skin and, voila, you have an AI-powered app.

Most in the field I talk to agree that Weights & Biases pretty much owns the experiment tracking market, even with projects like MLFlow and alternatives like CometML. Early last year, Weights & Biases was in conversations for a funding round that would value the startup at $2 billion and had between $20 million and $23 million in annual recurring revenue. (The round, which was opportunistic, didn’t materialize—though W&B would raise another strategic round later in the year.)

And Weights & Biases continues to quietly grow in the category it effectively helped create, even if it isn’t getting the kind of hype that generative AI application and infrastructure startups are getting. But there are many, many, many more developers than there are machine learning engineers—and they’re often the ones in the room when a CEO walks in and says “put AI in production” with absolutely no additional guidance or context.

Weights & Biases is also quickly becoming a test case of whether a highly-successful MLops company (alongside others in the modern data stack) can grow into a market built around a once-in-a-generation technology. And it’s now moving aggressively to show that it has a large place workflows built on modern AI tooling.

Powering developer workflows in AI

Weights & Biases held its first user conference in June last year and it captured the vibe of AI almost perfectly at that time—incredibly dense, chaotic, a little disorganized (with panels and talks going well into the evening), and overflowing with excitement and energy from attendees despite its practically remote location in the Dogpatch in San Francisco.

At the time, there were glimpses of what Weights & Biases planned for generative AI with the launch of tools for aiding prompt engineering. Its tools were perhaps one of the most well-suited to adapt into generative AI use cases as prompt hacking emerged as an important part of building out workflows on top of language models. (One developer with often describes this to me as “begging the model to do something.”)

This year’s conference was at the Mariott Marquis in San Francisco’s downtown, practically next door to another conference in the venue. Both more subdued and considerably more formal, it also represented the kind of vibe shift in AI from a limitless technology to something that enterprises were over experimenting with, and were looking to actually use it to drive value for their organizations.

As machine learning and deep learning took hold in larger companies, Weights & Biases quickly became a go-to platform for managing the lifecycle of machine learning models. That’s defined by experimentation, testing, iteration, and then (rarely) deployment—and starting the cycle over again. And one of its earliest customers was a small startup at the time that would go on to become arguably the most important company in modern AI: OpenAI.

Weights & Biases still carries a lot of weight when it comes to language models too, if only because of its ongoing deep connections with OpenAI. Earlier this month, OpenAI unveiled a slew of updates to its GPT-3.5 Turbo fine-tuning API. One of the biggest updates was third-party platform integration which, no surprise, included Weights & Biases as its first integration. Fine-tuning, while it offers a lot of potential benefits, is still considered pretty advanced fro most companies—and would by then have likely roped in the MLE team.

OpenAI (sort of) covers one of its blind spots

Matthew Lynley — Tue, 16 Apr 2024 13:01:21 GMT

a small friendly robot carrying a very large box full of laptops, monitors, and green notebooks, the box is overflowing, the robot is rushing out of an office, the robot is sweating and smiling awkwardly, sunday funny comics aesthetic — midjourney

Author’s note: Due to the busy nature of the week with multiple events and the launch of Llama 3, Friday’s column will be moved to early next week and be a double-issue.

OpenAI is adding another one of the missing pieces in its toolkit that other providers were quickly running off with—and this time it handles one of the most important pathways to enterprises actually putting language models in production.

Last week I’d noted that most companies that I’ve been talking to lately have been using modern language models as part of a batch data processing workflow. OpenAI this week announced the release of its batch processing API, which enables users to process large amounts of data on a less urgent timeline—but in exchange you get higher rate limits and half off the standard API price. (And it looks like OpenAI has finally adopted the per-million token pricing structure instead of per-thousand.)

Companies process large amounts of unstructured data, such as customer support tickets, and execute simple tasks such as classification, sentiment analysis, or summarization, on a longer timeline—typically using smaller models like Mistral 7B or Llama 13B on timescales that are hours (or even days). It’s not so dissimilar to what companies were already going with BERT, an early open source language model, and it’s now more efficient and effective.

OpenAI, the first mover in making language models broadly available through its GPT-series APIs, had effectively sat on the sidelines while companies like Snowflake and Databricks—hosting these smaller open source models adjacent to company data—built up what seems to increasingly be the standard use of language models in enterprises. While you could fire up a smaller Mistral model on top of your data, GPT-3.5 Turbo was still $0.50 per million input tokens and $1.50 per million output after its latest price cut.

In fact, when you talk to partners, platforms, or the enterprises themselves, batch data processing is the clearest case of “AI in prod” you’re going to get right now. The whole dream of some mythical autonomous agent replacing entire teams has largely faded into the background, and in its place is a bunch of pretty generic and boring use cases that actually return a disproportionate amount of value.

As is the case with many of those modalities and business models that OpenAI faces increasing competition from startups (and, somewhat less obviously, platform providers), it’s adding the API to its repertoire in a kind of better-late-than-never approach. But it looks like its own “flavor” here is that it’s offering a cheaper versions of its portfolio of models, rather than just one or two smaller ones.

There are plenty of reasons why OpenAI would want to do this, least of all to keep up with potential customers turning to others for cheaper batch data processing. But it also offers another interesting opportunity to sate another emerging challenge for companies that manage hordes of GPUs: getting as close to maximum utilization as possible.

This time around, though, OpenAI isn’t just dealing with a handful of startups trying to eat away its cost advantage on the edges. It has to contend with the challenge that enterprises handling batch data processing are the same ones that probably have their data already stored somewhere that offers it in some fashion.

The appeal of batch data processing

When a request that’s uniquely suited to a language model isn’t particularly urgent—like an angry customer on the line—it turns out you have a lot more language model options at your disposal. You could instead comb through all of those angry customer calls, using a less-powerful model to extract basic insights from them to understand why you product might be broken in the first place. A model with GPT-4 level quality isn’t really necessary, nor would it be particularly practical from a cost perspective.

A lot of these problems end up getting routed through centers because they have a kind of "eye test" feel to them, where you couldn't just directly extract the result with a standard machine learning model. Companies like Snowflake and Databricks have been implementing access to open source models like Mistral's, providing those kinds of tools right on top of a company's pool of data.

Realistically, a company could upgrade up to a more powerful model for batch processing at some multiple of the price of a smaller model like one from Mistral or Meta (or any of its open-available fine-tuned offshoots). With OpenAI’s batch pricing, it brings it more in line with a company looking at using a Mixtral-type model. And, again, the benefit OpenAI has broadly speaking is that it’s just incredibly easy to use.

So we’re going to look at two different price breakdowns—one that feels a smidge closer to reality for enterprises with batch processing already in place that are looking at using smaller (especially customized) models, and one that probably better reflects the reality of people that might use the OpenAI batch API.

As a usual caveat here, most of the value of these smaller models is unlocked through customization. So a raw endpoint-to-endpoint pricing is a little fuzzy as fine-tuned versions are hosted and have different pricing principles (such as GPU-per-hour pricing). OpenAI does offer a fine-tuned GPT-3.5 Turbo, but it’s on the order of $3 per million input and $6 per million output.

But, we’re working with what we have, and the smaller models are already somewhat capable of doing some of these hyper-specific workflow tasks (like less-complicated summarization) before fine-tuning. Let’s start with the comparisons with some endpoints of Mistral 7B on other providers to OpenAI’s GPT-3.5 Turbo model at a 50% discount:

As you’d expect, the pricing here doesn’t exactly lend itself well for OpenAI. There’s certainly an upper bound to the performance of these models and some additional complexity that comes with hosting a fine-tuned version, which would bring the performance for specific tasks more in line with what a GPT-series model can do from OpenAI.

OpenAI, however, is probably looking for a bit of a more favorable comparison here—with GPT 3.5-Turbo’s quality more in line with that of Mixtral or Claude Haiku. If we were to look at a comparison amongst those models…

Batch processing and the rise of CPUs

Matthew Lynley — Fri, 12 Apr 2024 21:12:13 GMT

a small sleek friendly pixar-styled robot standing underneath a towering GPU, the GPU is ten times the size of the robot, the robot is holding a power cable and a large book, it's daylight in the middle of a field like the windows XP background, sunday funny comics aesthetic, --ar 4:3 — midjourney

Both issues for the week are coming together in this one! First we’ll talk about how enterprises are actually using language models in prod, and then how pre-training still seems alive and well.

When ChatGPT launched, CEOs at pretty much every company you talk to gave a top-down mandate to implement AI into their products.

And to be clear, at the time, no one knew what that looked like. But the extreme hype around language models—along with the fact that they, in a lot of ways, worked out of the box—made them impossible to ignore. The fever dreams of replacing entire work functions were pervasive. And the common refrain was along the lines of, “if not me, then one of my competitors will do it.”

That hype cycle has ended, thankfully, and language models specifically have started to find a home within larger companies. But the reality among enterprises and platform providers I talk to, though, is that the use cases aren’t as epically transformative as we might have originally predicted.

Actually, they’re (at face value) pretty straightforward. Instead of creating complex reasoning tasks and full-on automation that replaces entire teams, many enterprises I talk to are either using language models, or plan to use language models, for batch data processing.

That can include summarization, classification, entity extraction, or even cleaning up data. And it shouldn’t exactly come as a shock—that’s what companies were doing with BERT, one of the earliest open source language models, in a lot of use cases. The emergence of the Llama-series models and Mistral’s models have just jacked up the quality of those results and reduced the barrier to entry to the point that they can be more efficient and (a little more) trustworthy.

“All of our customers have this dream of the full-on automation, but the steps to get there feel a little more incremental than a massive shift than all of our chats and emails and SMS handled by something that’s perhaps a black box,” Ben Gleitzman, CTO and co-founder of Replicant, a call center automation tool, told me. Replicant is backed by Salesforce Ventures, Norwest, Atomic, and others.

The “batch” part here refers to analyzing large amounts of data on a non-urgent timeline. The tasks here are usually relatively simple and straightforward and not in the kind of “write an email for me in the style of X” fashion that require the quality level of, say, a GPT- or Claude-series model. Instead, you’re more or less on a fact-finding mission within troves of proprietary company data.

While you could do it through one of OpenAI’s APIs, open source models like Mistral’s smaller 7B model, are actually well-suited to the problem—particularly for enterprises—because they’re efficient, cheap, and in many cases are already plugged into data abstraction layers. Databricks and Snowflake have both announced access to Mistral’s open source models through endpoints, and it’s an easy jump to a fine-tune because the data’s already there.

“We see a lot of batch processing on Snowflake, from summarization and the sentiment analysis of support tickets, to batch data extraction from SEC filings or the competitive analysis of lost deals,” Baris Gultekin, Head of AI at Snowflake, told me. “Large language models increase the productivity of analysts substantially, enabling them to quickly extract insights from text by using something as simple as the English language, all in a cost-effective way across millions of rows of data.”

More importantly, these smaller models don’t necessarily need the extreme power of a cluster of H100 GPUs. It’s even to the point that Apple’s Anwi Hannun, who is behind Apple’s excellent MLX Mac development and inference framework, and Llama.cpp creator Georgi Gerganov are competing over who can get the highest throughput for Mistral models on a Mac.

As these models continue to improve, and can run on low-power hardware, they’re becoming a very attractive option for companies looking to remove a lot of “eye tests” in workflows that might be offloaded to support centers. And while we wait for a “killer” use case for language models (if one ever emerges), it turns out that some enterprises have already found a reason to implement it—even if it’s a bit of a snoozer.

Where the advantage of smaller models come in

Anyone who’s worked in sales, research, marketing, or customer service has done their time sitting in on dozens of calls listening to what customers or prospects are saying and asking for, independent of what the data shows. Jeff Bezos gave a particularly good summary of why on an extensive interview with Lex Friedman:

When the data and anecdotes disagree, the anecdotes are usually right. It doesn’t mean you just slavishly slavishly follow the anecdotes, it means you go examine the data. It’s usually not that the data is being mis-collected it’s usually that you’re not measuring the right thing.

Now, there are a lot of ways to slice that statement. But one part of it is that understanding what customers need or are looking for comes down to a human with empathy and experience eyeballing something, and getting some insight that can’t be mechanically extracted with traditional machine learning. Or, at least, something you can’t extract at a level that’s fully trustworthy and error-proof.

While early, modern language models offer a potential chance to capture some of that so-called “anecdata” by trying to understand semantically what information is coming in. Many companies that would benefit from using it for batch processing have spent years collecting information from, for example, customer support resolutions. And the results are usually pretty binary: did the customer resolve their process and have a good experience, or not.

From there, you could extract flags that suggest what direction to send a ticket, or what the largest pain points customers have had recently. We’ve been trying for years to automate this with machine learning, but the advent of language models, even smaller ones like Mistral 7B, make the process much more approachable due to their flexibility and ease of implementation. And those language models are able to deliver real results in cases where you would want to eyeball data.

Supervised

What's next for Supervised

Early signs of a consolidation of the modern data stack

Please send OpenAI your good vibes as it dives into model evaluation

Getting into the mess that is model evals

A rare opening against Datadog

OpenAI’s sprawling portfolio problem

ChatGPT and finding efficiencies in inference

An opening in Datadog’s armor

Meta's Llamas go local

Meta’s model paths come together

Extending RPA to the edge

Notion tries to do a bit of everything in AI

Putting the brakes on the AI hype

The list of needs for enterprises keeps growing

AI in August: RBAC is back, data as a product, and something about a bubble

The RBAC wall to get into production

Someone please check on the data engineers

The point of lightning-fast model inference

The case for agents and speed

The free tokens will continue until business improves

The free stuff playbook is back in AI

About that AI bubble

The duality of the AI bubble

The less difficult, and more valuable, use cases for AI right now

Revisiting that old Google AI memo

What was called correctly in that memo, and what Google has done

The search for the next class of AI founders

For Apple, it's RPA all the way down

Lakes, catalogs, and Snowflake's full court press on AI

Lakes, warehouses, and another small startup worth more than $1 billion

A new whisper in the AI analytics room

Way too many coding models and a Bring Your Own Key paradigm on the horizon

Informatica gets a second look

Sponsored

This week’s newsletter is brought to you by Felicis

OpenAI's Silicon Valley pivot

Move fast and break things

The fine line between enterprise, consumer, and icon

AI in April (and Q2): RPA in focus, holistic evaluations, and eyes back on Datadog

RPA gets a closer look as a primitive agent

Where graph databases live in a future AI data stack

The return of graph databases

Sponsored

This week’s newsletter is brought to you by Felicis

Vibes-based evals and Weights & Biases' second act

OpenAI’s price advantage is eroding even further

Weights & Biases makes a bet on software developers

Powering developer workflows in AI

OpenAI (sort of) covers one of its blind spots

The appeal of batch data processing

Batch processing and the rise of CPUs

Where the advantage of smaller models come in