Will AI finally take real-time data mainstream?
Streaming pipelines have always been costly and difficult to implement. But rapid experimentation in AI means the wallets are suddenly open to revisit it.
Author’s note: due to some logistics I’ll be observing the holidays this week with only one post. There will be two as scheduled for next week. Have an excellent break!
The joke about data streaming is that next year is always the year it will finally go mainstream.
As we head into 2024, though, some of the promises of real-time data pipelines are starting to show up in one of the most-hyped technologies of the last decade—and it might ride get a chance to ride the AI wave.
Specifically, companies I talk to these days are evaluating whether much faster data ingestion and access can improve the performance of models by making more recent data available quickly. As retrieval augmented generation (or RAG) starts to grow in popularity, the prospect of having almost immediate access to more recent data is starting to get more appealing to companies actually evaluating models in production.
“If you are then saying that RAG really is one of the biggest catalysts of generative AI, then I think you go back and say what is your ability to get to the source of data that you need—which essentially says, do I have the kind of data I need,” Raj Verma, CEO of SingleStore, said. “If I have access to it how, quickly can I marry that so that it provides me with immediate context and then marry that recency and some historical context.”
You could think of this as a kind of “live” RAG architecture, where information comes in from some kind of source—like a user action or a sensor—and almost immediately embedded into a format that a language model can retrieve through RAG. And embedding models are quickly evolving into a much more sophisticated subset of AI, with multiple APIs now available and increasing interest in customizing those embedding models themselves through fine-tuning.
True streaming data pipelines have existed for some time, though they tend to serve more niche uses like fraud detection and recommendations. Most architectures are built on an open-source technology called Kafka. Confluent, a now-public company founded in 2014, is one of the largest providers of managed streaming services.
But each additional year in the development arc of streaming data tooling brings a kind of blurring between the definitions for “streaming” and “real-time”—which, while similar, have some differences under the hood. And the biggest challenge is they’re often used interchangeably.
The latter is effectively a catch-all that says that data arrives for a given purpose really fast, and not necessarily in a streaming fashion like through the use of Kafka. What ends up happening is tooling makes data available in smaller and smaller batches that process much more quickly—a process referred to as micro-batching. While orchestration tools have grown increasingly sophisticated to manage that kind of micro-batching, big-batch processing still remains widespread.
A company looking to offer a kind of AI assistant on top of their tooling, though, might suddenly find it much more prudent to make data available sooner than processing it just a few times—or even once—a day. You could imagine a customer uploading some kind of information and wanting to access it through an assistant right away, and waiting to process that in some accessible form would be a big hit to the user experience.
In both cases, AI is increasingly offering a big tailwind to interest in faster access to high-quality data—with the promise of more recent information available for retrieval being one example. Companies are almost universally looking to have some kind of AI strategy to jump on the hype train, and the necessary budgets are typically along for the ride. And streaming has the potential to power use cases in AI that aren’t typically relegated to the classic niche ones.
It’s not so dissimilar from the explosive growth of the modern data stack starting in the late 2020s, though it is at a much larger scale. Budgets quickly ballooned as companies started to rethink their whole data stacks and venture funding flowed like water into the ecosystem, which has sprawled into a complicated mess that most expect to consolidate.
But it means old problems (like streaming data) that largely showed some promise but little urgency now have the attention of executives at the top of the company, where the directive to build out an AI strategy typically starts.
“Nowadays, user expectations are such that we’ve all come to expect these reactive sophisticated apps,” Confluent’s head of technology strategy Andrew Sellers told me. “Streaming can take the mess of siloed enterprise data, and these data augmentation patterns, where you can stage wherever the data is and make it accessible for some RAG inference.”
Why data streaming was always around the corner
Streaming itself is based on the premise that data becomes immediately available once it is collected. This is typically relegated to much more advanced use cases like manufacturing (and by extension IoT), security, or financial services. But the prospect of all data pipelines becoming streaming pipelines has always been a tantalizing one, even if it’s not explicitly necessary for companies.
Streaming data pipelines, though, carry a lot of additional data management baggage compared to batch pipelines—even if the batches are tiny. For one, there’s a large challenge around orchestrating the motion of that information and routing it to the correct spot to make it accessible in the first place.
“I think this is less of a database or data repository challenge as it is about not being able to get the sources of information to find its way into the repository you want,” Verma said. “I don’t think it’s that much tech on the enterprise issue as it is a sourcing issue.”