AI in August: RBAC is back, data as a product, and something about a bubble
The data engineers are more important than ever these days.
The general consensus from sources and experts I talk to is that the universal truth of summer in tech and venture capital held this year: not a lot happened and everyone was on vacation.
That’s starting to pick up as we head into the fall, both with a lot of stock prices flying in opposite directions and earnings reports wrapping up. As teams start to trickle back into the office (home or otherwise), we’re starting to come up on the whole “AI in prod” mental deadline that showed up in a lot of companies I’ve spoken with over the last few months. The failed projects are going in the trash, and the stuff that’s useful is working its way into roadmaps.
That kind of “nearness” to production, as usual, is showing up in more friction and barriers to getting things out the door. In the past few months, that’s centered again on enterprise governance requirements and other some other kinds of technical walls—and a kind of pragmatic shift in the works at a lot of these companies think about their team structure.
And a lot of arguing about bubbles, which, as usual, is a much more complicated and nuanced situation than it seems at face value. So, with all that said, here’s what’s coming up as we head into the fall:
Role-based access control has entered the chat. Shortened to RBAC (and pronounced are-back), developers and industry executives had added the need to partition out who can get access to what data in an a language model into the very large stack of stuff that’s needed to get done before something enters production. This is coming up mostly in the context of internal chatbots, but there’s more to it under the hood.
The data engineers are now under the spotlight. AI (generative or otherwise) is quickly being recognized as a data-centric product. The data engineering teams, traditionally bound by the hip to analysts, now seem to be getting a lot more exposure at larger organizations.
It has been (0) days since we have said AI is a bubble. After more than a year of talking about whether AI is in a bubble, we are again… talking about whether AI is in a bubble. We’ll go over this again, but the answer is that it’s complicated.
The RBAC wall to get into production
One of the biggest potential problems most companies that I talk to are fretting about is whether some random employee will get access to sensitive data that they aren’t allowed to see. This isn’t restricted to PII or anything like that—it could even be in the context of someone who isn’t on an HR team getting access to salary information they aren’t supposed to able to view.
This potential snafu has a lot of names within companies, though most people I talk to throw the term “leakage” on it. As a result, there’s some skittishness around whether to basically feed all of a company’s data into a custom model (fine-tuning or otherwise), trying to slap some guardrails on it, and hope for the best.
Well, that’s not the first step companies usually take when they are looking at deploying something custom that taps company data. Instead, that company data is embedded into a database (such as a vector database, Postgres, or MongoDB) and a model can fetch it for a prompt through a process called retrieval-augmented generation, or RAG. There’s varying levels of sophistication to it, with some companies also using it in conjunction with graph search.
Each data point you’d want to retrieve is embedded in a “chunk,” or some block of information. The chunks can vary in size and format, ranging from just sentences in an email to full documents, and all that data has to be pre-processed in some form to even get to a point where a developer can embed it and make it available through RAG. But the list of governance requirements is continuously growing as companies get more serious about putting tools on top of language models into production, and determining who can access each chunk of information is now part of the set of requirements.
This is where role-based access control comes into play by assigning a kind of insulation layer around data points to determine who gets access to what. Some companies have already been trying to tackle the RBAC problem since earlier this year, but the momentum for it seems to generally be picking up among companies and investors I talk to—likely as a byproduct of these companies trying to get these products based on language models out the door.
“We went from a world where we were focused on data loaders in early 2023, to a world now you have to think about RBAC and bounding boxes so you have traceability and can show your homework,” Brian Raymond, CEO of unstructured data ETL startup Unstructured.io, told me. “Regardless of what’s in that manilla envelope of data that gets written to a vector database, you have to have timestamps, owner, which group it belonged to for RBAC, version history, and a lot of other information. Any time you’re doing more than a proof-of-concept, it is a complete blocker.”
As with the example above, this mostly comes up when talking about companies building internal chatbots that replace sprawling Wikis that are increasingly inaccessible. But it’s pretty easy to see how it extends out to potential end users for even a customer-facing product—after all, an employee is a customer of an internal-facing product.
“We generate around 30 types of metadata during preprocessing, and that’s critical when you’re doing retrieval,” Raymond told me. “And we’re transposing the requirements data engineering teams have been developing over the last ten years onto the generative AI data stack.”
Again, guardrailing is one way to do it, but most developers and companies I talk to are increasingly recognizing that they’ll need something more sophisticated in place. And that leads to another emerging problem in AI, which is…
Someone please check on the data engineers
Each of these companies is also increasingly realizing that the problem of getting a generative AI product in production is… the same problem as getting a regular data centric products into production: high quality proprietary data.
Data engineers have traditionally built out pipelines that feed into more “classic” workflows like analytics. The emergence of Dbt and the whole analytics engineering role started to extend the responsibilities of analysts to handle more data engineering-oriented tasks. But there are two emerging trends within all these companies that are actually trying to move past a proof-of-concept into something that can generate a real return:
Many companies are tasking software engineers with building all of these AI-powered tools, who have to quickly grapple with the importance high fidelity data plays into all of these tools.