27 big data startups I'm watching closely as AI inches toward production
The tools built for managing the data under AI are just as important as the models themselves.
I’ve written before about how AI is, at its core, a story about how data moves into and out of these tools.
There’s training data, data used in customization and fine-tuning, data use in retrieval engines coming from whatever type of store you have, and the actual words and numbers that go into a prompt in the first place. And the value of any given AI tool relies heavily on the quality of the data that goes into it. And data within larger enterprises inevitably carries with it an enormous amount of baggage around what can be used, where, and when, and under what conditions.
In the early 2020s, we saw the emergence of the modern data stack, with dozens of startups attempting to essentially address this problem for classic machine learning applications—the churn models and recommendation engines of the world. That’s certainly changed with the advent of modern AI in the form of language and diffusion models.
But what we’re seeing over and over again—and I’ve probably written the same phrase a half-dozen times at this point—is those same problems in classic machine learning apply directly to modern AI. What’s old is new again, and how that data flows and is managed is just as important as the models themselves as we start to get closer to a phase where all this stuff finally goes live.
Late last year, I published an extensive list of nearly 30 companies that I was following closely as OpenAI had its meltdown and companies started getting closer to getting all these tools out the door. So, as I said I would in that post, I’m putting out a list of all the big data companies that sit underneath (or adjacent) to all that AI rigamarole. These companies are a bit more mature, but the problems they address are just as important—if not more so—than all the cool stuff showing up on Hugging Face these days.
The interesting part of all this is the long-expected consolidation of the modern data stack—and all these companies—hasn’t really happened yet. Databricks, Snowflake, and MongoDB have all expanded their product suites, but they’ve also still partnered closely with many of these companies. And I suspect it’s because of a broad revisiting of the importance of the data layer underneath AI that is injecting renewed life in all of the work that’s been done between the “Cambrian explosion,” coined by Dbt CEO Tristan Handy, and the emergence of ChatGPT.
Before we get started on the list though, I wanted to add a startup to my last list of AI companies that I somehow forgot to make it an even 30: Ideogram. Ideogram offers a really unique and interesting competitor to Midjourney and DALL-E, if only because they figured out how to do generate text properly in the first place. It seems a little silly but this for a long time seemed like a bizarre problem that image-generation tools just could not crack. That’s changed, but the fact that Ideogram was able to get to it first means there’s likely more happening behind the scenes than at face value.
Again, this list certainly doesn’t cover everything and excludes some of the (even more) established candidates. The modern data stack is a massive daisy chain that’s only grown longer over time, and it’s pretty tough to keep track of everything.
With that, let’s get to the actual list of big data startups!
27 big data startups that are on my radar as companies get closer to AI in production
Redpanda: With the advent of retrieval augmented generation as a general solution for hallucinations, there’s been some shifted focus back to real-time data pipelines that go beyond standard use cases. This kind of “live RAG” architecture is starting to get some more interest and potentially offers a path toward more formal adoption of streaming data.
Select Star: Select Star (a pun on a SQL query) is one of the buzzier emerging startups around data discoverability and governance. And as we’ve said before at least a million times at this point, data governance tooling is actually offering a more formal pathway toward deploying language models.
Neo4j: This isn’t specific to Neo4j, but there’s a growing interesting opportunity for graph databases more broadly—which Neo4j specializes in. What if you could layer a graph on top of a vector database that is able to link together relationships for embedded chunked data to preserve the relationships between chunks? While graph databases have largely been relegated to niche use cases, the growing focus on embeddings is starting to make it in principle more interesting.
Fiddler: Datadog has largely snapped up the classic data observability market, but at the moment hasn’t made a significant motion into AI. That’s essentially allowed Fiddler to become an early mover in AI observability, another key data governance tool necessary for getting all these tools into production.
Temporal: While LangChain and LlamaIndex were largely pegged as LLM orchestration, no startup comes up more often in that realm than Temporal. Model development and maintenance in production is essentially a loop, rather than a one-way pathway, and Temporal offers an interesting mechanism for becoming a de facto orchestration tool for LLMs that are constantly updated through fine tuning, RAG, and potentially LoRA adapter swapping (a technique laid out by Predibase that is gaining traction).