Meta's Llamas go local
Plus: Notion, one of AI's brightest stars, rolls out a fuller-fledged AI companion, and several additional executives leave OpenAI.
Existing paid subscribers received a one week extension to account for a missed issue last week while I’m working through another lower back injury. Thank you everyone for your patience!
Meta’s model paths come together
Meta’s Llama portfolio continues to grow in size, complexity, and nomenclature—but this time, it’s going after what’s quickly becoming one of the most important potential markets in AI: networks of task-specific smaller models.
While Meta has released larger models in a way to sow a little chaos at the top end of the quality spectrum with OpenAI and Anthropic, it seems to be setting its eyes on sowing more chaos at the lower end of the spectrum where Google’s Gemma 2 and Microsoft’s Phi models have become popular. Meta’s new tiny models, which it says in a blog post are designed for edge use cases, fit neatly into one of the kind of four “bucket” cases. And Meta is now competing with many of its “rivals” in all four tiers.
“Tiny” ones that are suited for trivial and straightforward tasks with very little compute: Gemma 2, Phi-3, Llama 3.2 1B and 3B, and other various open source models. These models are well-suited for cases where you have access to little memory or compute, such as the edge devices Meta suggests.
Smaller, more general purpose “workhorse” models designed to handle mostly trivial tasks that are still too complex for the “tiny” models: Gemma 2 9B, Llama 3.1 8B, Llama 3.2 11B, Mistral NeMo/Pixtral, and other various open source models. In these cases you’re less memory-constrained, but cost starts to become a consideration, making these smaller open-ish source models more attractive than APIs like GPT-4o-mini. These could involve summarizing a high volume of inbound documents, like sales calls or customer support tickets.
The not-quite-as-good-as-GPT-4-but-still-pretty-good category where tasks start to become a little more ambiguous and open-ended: Meta’s 70B (and now 90B) models fit into here, but also you could throw Cohere’s RAG-focused Command-R+ and the larger versions of the Gemma-series models. You could imagine more complex searches, like across legal documents, that would be well-suited to a GPT-4o but are willing to take a small penalty on performance and ease of use—such as some increased complexity for fine-tuning—in exchange for lower costs.
The substantially larger and more general purpose versions requiring high-quality responses: Llama 405B and the available foundation model APIs like Anthropic’s Sonnet, Gemini Pro, and GPT-4o. These are more suited to extremely high quality conversational experiences like the ones you might find in Notion (which, more on that in a second).
OpenAI is also seemingly trying to create a new category in those buckets with its slower-but-more-powerful o1 series models. In that case, OpenAI is once again flexing its research capabilities, even if that means trying to build out into a new niche.
But an enormous amount of emphasis from this announcement seems to land on that “edge device” part, with Meta off a number of different pieces of hardware it will work on. While that might end up providing some useful experiences on devices, the results from Apple’s (beta) approach integrated into one of the most prevalent operating systems on the planet have been pretty mixed. The upper bound of these tiny models on edge devices—even augmented with a set of task-specific adaptors—isn’t so clear.
But Meta will also be able to capitalize on the enormous enthusiasm around local device AI inference, which has shown a lot of opportunity around rapid prototyping and development. Meta here gets to once again pounce on a developer community that’s craving more options to play with, and then just grab the best ideas for its own edge devices like its prototype Orion headset. And that starts with finding the absolute bare bones utility of those smaller models.
Extending RPA to the edge
While these smaller models are designed to work well on edge devices, with some customization on task-specific or company-specific data they excel at simple jobs like classification, entity extraction, or summarization. The emerging basic use case for AI has basically been to construct a more advanced version of robotic process automation (or RPA) through the use of smaller models chained together that all independently resolve these more trivial tasks.
That RPA-oriented “base case” has emerged as a pathway to quickly recognize an actual return on investment in AI tooling at a time when we’re all talking about how we’re in an enormous AI bubble. And a lot of this was already powered by Meta’s existing recent smaller Llama 3.1 model, though there has generally been some interest in experimenting with the even smaller ones amongst companies I talk to (particularly the “tiny” version of Phi 3).
Meta’s existing smaller model (8B) is getting an upgrade to being multimodal and slightly larger (11B), but still has a small enough footprint you could probably comfortably run a compact version of it on a Macbook Pro. A Hugging Face blog post indicates a compact version of its vision model takes up around 10GB of GPU RAM during inference. And unsurprisingly, its “tiny” models are already available on Ollama, a popular tool for running smaller models locally on devices like a Macbook Pro.
It’s also why we’ve started to see a kind of obsession over extreme speed when generating results from smaller models—like in the thousands of tokens per second. Rather than just looking for a single response, fast inference platforms enable companies to quickly chain together task-specific models into a kind of proto-agentic network to, at the end of the chain, accomplish some more difficult task.
While these kinds of “buckets” have been in the making for the past year or so, Meta’s more direct entry into each with the newest iteration of its Llama 3 models has formalized the boundaries. Rather than constructing the categories based on model size—which can vary while being similar in performance—the divisions are based on task complexity.
Many enterprises I’ve spoken with interested in actually putting AI-based tooling into production have shifted their thinking in how they plan to deploy AI. Expectations have dropped dramatically from a time when executives thought they could throw everything into an OpenAI API and assuming you’ve completely replaced entire departments, but those lowered expectations have also exposed how practical some of these smaller language models are for actually useful tasks. And hosting and deploying these smaller and more customized models is significantly more attractive now than it was a year ago.
Part of that is because of the significant cost to relying on APIs like OpenAI and Anthropic for those more powerful frontier models—even when it comes to their smaller counterparts like GPT-4o-mini. While the price of those API backed models continues to go down (and is, to be clear, a race to the bottom) it’s still considerably cheaper to host your own custom version of a smaller language model. And in some ways, those smaller models can even be a bit better because they’re more tailored to specific use cases.
Going even smaller than the “practical” small models exposes a lot of potential new use cases when it comes to edge devices—which could be phones, but could also be sensors, an array of Raspberry Pis. But when we’re talking about “edge” use cases, one in particular has lately gotten a lot of interest from executives and sources, even if it’s still just a curiosity: in-browser language model experiences.
In this case, using WebGPU, the “edge” device is actually a browser. One example thrown I’ve seen thrown around before is an in-browser SQL-generating assistant running in tandem with DuckDB, creating a whole kind of new local analytics experience. (There are some technical concerns that come up here around threads and available memory, but again, it’s an example.)
Meta, in the meantime, is also benefitting from jumping on the very hyped and exciting—and still relatively impractical in production—focus on local model development. Llama.cpp, Ollama, and other tools enabling local model usage has sparked a whole wave of enthusiasm around rapid prototyping and development of potential AI-powered products at a larger scale.
(Meta, alas, did not give me a heads up on the news or invite me to any of these events. One day!)
Notion tries to do a bit of everything in AI
On a rather hectic news day with the new Llama-series models and the abrupt departure of OpenAI CTO Mira Murati, Notion also rolled out a significantly updated version of its integrated AI tools.
Notion is considered among most of the sources and industry experts I speak with to be one of the most advanced companies when it comes to actually putting AI tools into production. There’s a handful of companies routinely come up, but most typically think Notion’s approach and architecture is one of the most complex and effective ways to utilize everything that’s available.
And in that way, a lot of investors, developers, and other startups look at Notion as a kind of “bellwether” in AI development—letting the startup do all the experimentation and product development to see what actually is possible and sticks with customers.
Its latest update includes connectors to Slack and Google Drive, introducing a new set of unstructured data pipelines, as well as other updates like analyzing files and image attachments, style-specific document generation and editing, and direct selection of knowledge sources.
“I’d say we are small enough and nimble enough that we don’t have to go through a bunch of layers of approval to try something new out,” Shir Yehoshua, AI engineering lead at Notion, told me. “We have a very prototype-y culture. Our general model is to prototype it first and if it works—whether that’s an evaluation tool, or an actual LLM provider, a vector DB—as fast as we possibly can and get very empirical data on how well it works. And also try not to fall into a sunk cost fallacy.”