Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Generative AI in the Real World: The LLMOps Shift with Abi Aryan

20 November 2025 at 07:16

MLOps is dead. Well, not really, but for many the job is evolving into LLMOps. In this episode, Abide AI founder and LLMOps author Abi Aryan joins Ben to discuss what LLMOps is and why it’s needed, particularly for agentic AI systems. Listen in to hear why LLMOps requires a new way of thinking about observability, why we should spend more time understanding human workflows before mimicking them with agents, how to do FinOps in the age of generative AI, and more.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right, so today we have Abi Aryan. She is the author of the O’Reilly book on LLMOps as well as the founder of Abide AI. So, Abi, welcome to the podcast. 

00.19: Thank you so much, Ben. 

00.21: All right. Let’s start with the book, which I confess, I just cracked open: LLMOps. People probably listening to this have heard of MLOps. So at a high level, the models have changed: They’re bigger, they’re generative, and so on and so forth. So since you’ve written this book, have you seen a wider acceptance of the need for LLMOps? 

00.51: I think more recently there are more infrastructure companies. So there was a conference happening recently, and there was this sort of perception or messaging across the conference, which was “MLOps is dead.” Although I don’t agree with that. 

There’s a big difference that companies have started to pick up on more recently, as the infrastructure around the space has sort of started to improve. They’re starting to realize how different the pipelines were that people managed and grew, especially for the older companies like Snorkel that were in this space for years and years before large language models came in. The way they were handling data pipelines—and even the observability platforms that we’re seeing today—have changed tremendously.

01.40: What about, Abi, the general. . .? We don’t have to go into specific tools, but we can if you want. But, you know, if you look at the old MLOps person and then fast-forward, this person is now an LLMOps person. So on a day-to-day basis [has] their suite of tools changed? 

02.01: Massively. I think for an MLOps person, the focus was very much around “This is my model. How do I containerize my model, and how do I put it in production?” That was the entire problem and, you know, most of the work was around “Can I containerize it? What are the best practices around how I arrange my repository? Are we using templates?” 

Drawbacks happened, but not as much because most of the time the stuff was tested and there was not too much indeterministic behavior within the models itself. Now that has changed.

02.38: [For] most of the LLMOps engineers, the biggest job right now is doing FinOps really, which is controlling the cost because the models are massive. The second thing, which has been a big difference, is we have shifted from “How can we build systems?” to “How can we build systems that can perform, and not just perform technically but perform behaviorally as well?”: “What is the cost of the model? But also what is the latency? And see what’s the throughput looking like? How are we managing the memory across different tasks?” 

The problem has really shifted when we talk about it. . . So a lot of focus for MLOps was “Let’s create fantastic dashboards that can do everything.” Right now it’s no matter which dashboard you create, the monitoring is really very dynamic. 

03.32: Yeah, yeah. As you were talking there, you know, I started thinking, yeah, of course, obviously now the inference is essentially a distributed computing problem, right? So that was not the case before. Now you have different phases even of the computation during inference, so you have the prefill phase and the decode phase. And then you might need different setups for those. 

So anecdotally, Abi, did the people who were MLOps people successfully migrate themselves? Were they able to upskill themselves to become LLMOps engineers?

04.14: I know a couple of friends who were MLOps engineers. They were teaching MLOps as well—Databricks folks, MVPs. And they were now transitioning to LLMOps.

But the way they started is they started focusing very much on, “Can you do evals for these models? They weren’t really dealing with the infrastructure side of it yet. And that was their slow transition. And right now they’re very much at that point where they’re thinking, “OK, can we make it easy to just catch these problems within the model—inferencing itself?”

04.49: A lot of other problems still stay unsolved. Then the other side, which was like a lot of software engineers who entered the field and became AI engineers, they have a much easier transition because software. . . The way I look at large language models is not just as another machine learning model but literally like software 3.0 in that way, which is it’s an end-to-end system that will run independently.

Now, the model isn’t just something you plug in. The model is the product tree. So for those people, most software is built around these ideas, which is, you know, we need a strong cohesion. We need low coupling. We need to think about “How are we doing microservices, how the communication happens between different tools that we’re using, how are we calling up our endpoints, how are we securing our endpoints?”

Those questions come easier. So the system design side of things comes easier to people who work in traditional software engineering. So the transition has been a little bit easier for them as compared to people who were traditionally like MLOps engineers. 

05.59: And hopefully your book will help some of these MLOps people upskill themselves into this new world.

Let’s pivot quickly to agents. Obviously it’s a buzzword. Just like anything in the space, it means different things to different teams. So how do you distinguish agentic systems yourself?

06.24: There are two words in the space. One is agents; one is agent workflows. Basically agents are the components really. Or you can call them the model itself, but they’re trying to figure out what you meant, even if you forgot to tell them. That’s the core work of an agent. And the work of a workflow or the workflow of an agentic system, if you want to call it, is to tell these agents what to actually do. So one is responsible for execution; the other is responsible for the planning side of things. 

07.02: I think sometimes when tech journalists write about these things, the general public gets the notion that there’s this monolithic model that does everything. But the reality is, most teams are moving away from that design as you, as you describe.

So they have an agent that acts as an orchestrator or planner and then parcels out the different steps or tasks needed, and then maybe reassembles in the end, right?

07.42: Coming back to your point, it’s now less of a problem of machine learning. It’s, again, more like a distributed systems problem because we have multiple agents. Some of these agents will have more load—they will be the frontend agents, which are communicating to a lot of people. Obviously, on the GPUs, these need more distribution.

08.02: And when it comes to the other agents that may not be used as much, they can be provisioned based on “This is the need, and this is the availability that we have.” So all of that provisioning again is a problem. The communication is a problem. Setting up tests across different tasks itself within an entire workflow, now that becomes a problem, which is where a lot of people are trying to implement context engineering. But it’s a very complicated problem to solve. 

08.31: And then, Abi, there’s also the problem of compounding reliability. Let’s say, for example, you have an agentic workflow where one agent passes off to another agent and yet to another third agent. Each agent may have a certain amount of reliability, but it compounds over time. So it compounds across this pipeline, which makes it more challenging. 

09.02: And that’s where there’s a lot of research work going on in the space. It’s an idea that I’ve talked about in the book as well. At that point when I was writing the book, especially chapter four, in which a lot of these were described, most of the companies right now are [using] monolithic architecture, but it’s not going to be able to sustain as we go towards application.

We have to go towards a microservices architecture. And the moment we go towards microservices architecture, there are a lot of problems. One will be the hardware problem. The other is consensus building, which is. . . 

Let’s say you have three different agents spread across three different nodes, which would be running very differently. Let’s say one is running on an edge one hundred; one is running on something else. How can we achieve consensus if even one of the nodes ends up winning? So that’s open research work [where] people are trying to figure out, “Can we achieve consensus in agents based on whatever answer the majority is giving, or how do we really think about it?” It should be set up at a threshold at which, if it’s beyond this threshold, then you know, this perfectly works.

One of the frameworks that is trying to work in this space is called MassGen—they’re working on the research side of solving this problem itself in terms of the tool itself. 

10.31: By the way, even back in the microservices days in software architecture, obviously people went overboard too. So I think that, as with any of these new things, there’s a bit of trial and error that you have to go through. And the better you can test your systems and have a setup where you can reproduce and try different things, the better off you are, because many times your first stab at designing your system may not be the right one. Right? 

11.08: Yeah. And I’ll give you two examples of this. So AI companies tried to use a lot of agentic frameworks. You know people have used Crew; people have used n8n, they’ve used. . . 

11.25: Oh, I hate those! Not I hate. . . Sorry. Sorry, my friends and crew. 

11.30: And 90% of the people working in this space seriously have already made that transition, which is “We are going to write it ourselves. 

The same happened for evaluation: There were a lot of evaluation tools out there. What they were doing on the surface is literally just tracing, and tracing wasn’t really solving the problem—it was just a beautiful dashboard that doesn’t really serve much purpose. Maybe for the business teams. But at least for the ML engineers who are supposed to debug these problems and, you know, optimize these systems, essentially, it was not giving much other than “What is the error response that we’re getting to everything?”

12.08: So again, for that one as well, most of the companies have developed their own evaluation frameworks in-house, as of now. The people who are just starting out, obviously they’ve done. But most of the companies that started working with large language models in 2023, they’ve tried every tool out there in 2023, 2024. And right now more and more people are staying away from the frameworks and launching and everything.

People have understood that most of the frameworks in this space are not superreliable.

12.41: And [are] also, honestly, a bit bloated. They come with too many things that you don’t need in many ways. . .

12:54: Security loopholes as well. So for example, like I reported one of the security loopholes with LangChain as well, with LangSmith back in 2024. So those things obviously get reported by people [and] get worked on, but the companies aren’t really proactively working on closing those security loopholes. 

13.15: Two open source projects that I like that are not specifically agentic are DSPy and BAML. Wanted to give them a shout out. So this point I’m about to make, there’s no easy, clear-cut answer. But one thing I noticed, Abi, is that people will do the following, right? I’m going to take something we do, and I’m going to build agents to do the same thing. But the way we do things is I have a—I’m just making this up—I have a project manager and then I have a designer, I have role B, role C, and then there’s certain emails being exchanged.

So then the first step is “Let’s replicate not just the roles but kind of the exchange and communication.” And sometimes that actually increases the complexity of the design of your system because maybe you don’t need to do it the way the humans do it. Right? Maybe if you go to automation and agents, you don’t have to over-anthropomorphize your workflow. Right. So what do you think about this observation? 

14.31: A very interesting analogy I’ll give you is people are trying to replicate intelligence without understanding what intelligence is. The same for consciousness. Everybody wants to replicate and create consciousness without understanding consciousness. So the same is happening with this as well, which is we are trying to replicate a human workflow without really understanding how humans work.

14.55: And sometimes humans may not be the most efficient thing. Like they exchange five emails to arrive at something. 

15.04: And humans are never context defined. And in a very limiting sense. Even if somebody’s job is to do editing, they’re not just doing editing. They are looking at the flow. They are looking for a lot of things which you can’t really define. Obviously you can over a period of time, but it needs a lot of observation to understand. And that skill also depends on who the person is. Different people have different skills as well. Most of the agentic systems right now, they’re just glorified Zapier IFTTT routines. That’s the way I look at them right now. The if recipes: If this, then that.

15.48: Yeah, yeah. Robotic process automation I guess is what people call it. The other thing that people I don’t think understand just reading the popular tech press is that agents have levels of autonomy, right? Most teams don’t actually build an agent and unleash it full autonomous from day one.

I mean, I guess the analogy would be in self-driving cars: They have different levels of automation. Most enterprise AI teams realize that with agents, you have to kind of treat them that way too, depending on the complexity and the importance of the workflow. 

So you go first very much a human is involved and then less and less human over time as you develop confidence in the agent.

But I think it’s not good practice to just kind of let an agent run wild. Especially right now. 

16.56: It’s not, because who’s the person answering if the agent goes wrong? And that’s a question that has come up often. So this is the work that we’re doing at Abide really, which is trying to create a decision layer on top of the knowledge retrieval layer.

17.07: Most of the agents which are built using just large language models. . . LLMs—I think people need to understand this part—are fantastic at knowledge retrieval, but they do not know how to make decisions. If you think agents are independent decision makers and they can figure things out, no, they cannot figure things out. They can look at the database and try to do something.

Now, what they do may or may not be what you like, no matter how many rules you define across that. So what we really need to develop is some sort of symbolic language around how these agents are working, which is more like trying to give them a model of the world around “What is the cause and effect, with all of these decisions that you’re making? How do we prioritize one decision where the. . .? What was the reasoning behind that so that entire decision making reasoning here has been the missing part?”

18.02: You brought up the topic of observability. There’s two schools of thought here as far as agentic observability. The first one is we don’t need new tools. We have the tools. We just have to apply [them] to agents. And then the second, of course, is this is a new situation. So now we need to be able to do more. . . The observability tools have to be more capable because we’re dealing with nondeterministic systems.

And so maybe we need to capture more information along the way. Chains of decision, reasoning, traceability, and so on and so forth. Where do you fall in this kind of spectrum of we don’t need new tools or we need new tools? 

18.48: We don’t need new tools, but we certainly need new frameworks, and especially a new way of thinking. Observability in the MLOps world—fantastic; it was just about tools. Now, people have to stop thinking about observability as just visibility into the system and start thinking of it as an anomaly detection problem. And that was something I’d written in the book as well. Now it’s no longer about “Can I see what my token length is?” No, that’s not enough. You have to look for anomalies at every single part of the layer across a lot of metrics. 

19.24: So your position is we can use the existing tools. We may have to log more things. 

19.33: We may have to log more things, and then start building simple ML models to be able to do anomaly detection. 

Think of managing any machine, any LLM model, any agent as really like a fraud detection pipeline. So every single time you’re looking for “What are the simplest signs of fraud?” And that can happen across various factors. But we need more logging. And again you don’t need external tools for that. You can set up your own loggers as well.

Most of the people I know have been setting up their own loggers within their companies. So you can simply use telemetry to be able to a.) define a set and use the general logs, and b.) be able to define your own custom logs as well, depending on your agent pipeline itself. You can define “This is what it’s trying to do” and log more things across those things, and then start building small machine learning models to look for what’s going on over there.

20.36: So what is the state of “Where we are? How many teams are doing this?” 

20.42: Very few. Very, very few. Maybe just the top bits. The ones who are doing reinforcement learning training and using RL environments, because that’s where they’re getting their data to do RL. But people who are not using RL to be able to retrain their model, they’re not really doing much of this part; they’re still depending very much on external accounts.

21.12: I’ll get back to RL in a second. But one topic you raised when you pointed out the transition from MLOps to LLMOps was the importance of FinOps, which is, for our listeners, basically managing your cloud computing costs—or in this case, increasingly mastering token economics. Because basically, it’s one of these things that I think can bite you.

For example, the first time you use Claude Code, you go, “Oh, man, this tool is powerful.” And then boom, you get an email with a bill. I see, that’s why it’s powerful. And you multiply that across the board to teams who are starting to maybe deploy some of these things. And you see the importance of FinOps.

So where are we, Abi, as far as tooling for FinOps in the age of generative AI and also the practice of FinOps in the age of generative AI? 

22.19: Less than 5%, maybe even 2% of the way there. 

22:24: Really? But obviously everyone’s aware of it, right? Because at some point, when you deploy, you become aware. 

22.33: Not enough people. A lot of people just think about FinOps as cloud, basically the cloud cost. And there are different kinds of costs in the cloud. One of the things people are not doing enough is not profiling their models properly, which is [determining] “Where are the costs really coming from? Our models’ compute power? Are they taking too much RAM? 

22.58: Or are we using reasoning when we don’t need it?

23.00: Exactly. Now that’s a problem we solve very differently. That’s where yes, you can do kernel fusion. Define your own custom kernels. Right now there’s a massive number of people who think we need to rewrite kernels for everything. It’s only going to solve one problem, which is the compute-bound problem. But it’s not going to solve the memory-bound problem. Your data engineering pipelines aren’t what’s going to solve your memory-bound problems.

And that’s where most of the focus is missing. I’ve mentioned it in the book as well: Data engineering is the foundation of first being able to solve the problems. And then we moved to the compute-bound problems. Do not start optimizing the kernels over there. And then the third part would be the communication-bound problem, which is “How do we make these GPUs talk smarter with each other? How do we figure out the agent consensus and all of those problems?”

Now that’s a communication problem. And that’s what happens when there are different levels of bandwidth. Everybody’s dealing with the internet bandwidth as well, the kind of serving speed as well, different kinds of cost and every kind of transitioning from one node to another. If we’re not really hosting our own infrastructure, then that’s a different problem, because it depends on “Which server do you get assigned your GPUs on again?”

24.20: Yeah, yeah, yeah. I want to give a shout out to Ray—I’m an advisor to Anyscale—because Ray basically is built for these sorts of pipelines because it can do fine-grained utilization and help you decide between CPU and GPU. And just generally, you don’t think that the teams are taking token economics seriously?

I guess not. How many people have I heard talking about caching, for example? Because if it’s a prompt that [has been] answered before, why do you have to go through it again? 

25.07: I think plenty of people have started implementing KV caching, but they don’t really know. . . Again, one of the questions people don’t understand is “How much do we need to store in the memory itself, and how much do we need to store in the cache?” which is the big memory question. So that’s the one I don’t think people are able to solve. A lot of people are storing too much stuff in the cache that should actually be stored in the RAM itself, in the memory.

And there are generalist applications that don’t really understand that this agent doesn’t really need access to the memory. There’s no point. It’s just lost in the throughput really. So I think the problem isn’t really caching. The problem is that differentiation of understanding for people. 

25.55: Yeah, yeah, I just threw that out as one element. Because obviously there’s many, many things to mastering token economics. So you, you brought up reinforcement learning. A few years ago, obviously people got really into “Let’s do fine-tuning.” But then they quickly realized. . . And actually fine-tuning became easy because basically there became so many services where you can just focus on labeled data. You upload your labeled data, boom, come back from lunch, you have a fine-tuned model.

But then people realize that “I fine-tuned, but the model that results isn’t really as good as my fine-tuning data.” And then obviously RAG and context engineering came into the picture. Now it seems like more people are again talking about reinforcement learning, but in the context of LLMs. And there’s a lot of libraries, many of them built on Ray, for example. But it seems like what’s missing, Abi, is that fine-tuning got to the point where I can sit down a domain expert and say, “Produce labeled data.” And basically the domain expert is a first-class participant in fine-tuning.

As best I can tell, for reinforcement learning, the tools aren’t there yet. The UX hasn’t been figured out in order to bring in the domain experts as the first-class citizen in the reinforcement learning process—which they need to be because a lot of the stuff really resides in their brain. 

27.45: The big problem here, and very, very much to the point of what you pointed out, is the tools aren’t really there. And one very specific thing I can tell you is most of the reinforcement learning environments that you’re seeing are static environments. Agents are not learning statically. They are learning dynamically. If your RL environment cannot adapt dynamically, which basically in 2018, 2019, emerged as the OpenAI Gym and a lot of reinforcement learning libraries were coming out.

28.18: There is a line of work called curriculum learning, which is basically adapting your model’s difficulty to the results itself. So basically now that can be used in reinforcement learning, but I’ve not seen any practical implementation of using curriculum learning for reinforcement learning environments. So people create these environments—fantastic. They work well for a little bit of time, and then they become useless.

So that’s where even OpenAI, Anthropic, those companies are struggling as well. They’ve paid heavily in contracts, which are yearlong contracts to say, “Can you build this vertical environment? Can you build that vertical environment?” and that works fantastically But once the model learns on it, then there’s nothing else to learn. And then you go back into the question of, “Is this data fresh? Is this adaptive with the world?” And it becomes the same RAG problem over again. 

29.18: So maybe the problem is with RL itself. Maybe maybe we need a different paradigm. It’s just too hard. 

Let me close by looking to the future. The first thing is—the space is moving so hard, this might be an impossible question to ask, but if you look at, let’s say, 6 to 18 months, what are some things in the research domain that you think are not being talked enough about that might produce enough practical utility that we will start hearing about them in 6 to 12, 6 to 18 months?

29.55: One is how to profile your machine learning models, like the entire systems end-to-end. A lot of people do not understand them as systems, but only as models. So that’s one thing which will make a massive amount of difference. There are a lot of AI engineers today, but we don’t have enough system design engineers.

30.16: This is something that Ion Stoica at Sky Computing Lab has been giving keynotes about. Yeah. Interesting. 

30.23: The second part is. . . I’m optimistic about seeing curriculum learning applied to reinforcement learning as well, where our RL environments can adapt in real time so when we train agents on them, they are dynamically adapting as well. That’s also [some] of the work being done by labs like Circana, which are working in artificial labs, artificial light frame, all of that stuff—evolution of any kind of machine learning model accuracy. 

30.57: The third thing where I feel like the communities are falling behind massively is on the data engineering side. That’s where we have massive gains to get. 

31.09: So on the data engineering side, I’m happy to say that I advise several companies in the space that are completely focused on tools for these new workloads and these new data types. 

Last question for our listeners: What mindset shift or what skill do they need to pick up in order to position themselves in their career for the next 18 to 24 months?

31.40: For anybody who’s an AI engineer, a machine learning engineer, an LLMOps engineer, or an MLOps engineer, first learn how to profile your models. Start picking up Ray very quickly as a tool to just get started on, to see how distributed systems work. You can pick the LLM if you want, but start understanding distributed systems first. And once you start understanding those systems, then start looking back into the models itself. 

32.11: And with that, thank you, Abi.

💾

Generative AI in the Real World: Chris Butler on GenAI in Product Management

30 October 2025 at 07:29

In this episode, Ben Lorica and Chris Butler, director of product operations for GitHub’s Synapse team, chat about the experimentation Chris is doing to incorporate generative AI into the product development process—particularly with the goal of reducing toil for cross-functional teams. It isn’t just automating busywork (although there’s some of that). He and his team have created agents that expose the right information at the right time, use feedback in meetings to develop “straw man” prototypes for the team to react to, and even offer critiques from specific perspectives (a CPO agent?). Very interesting stuff.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: Today we have Chris Butler of GitHub, where he leads a team called the Synapse. Welcome to the podcast, Chris. 

00.15: Thank you. Yeah. Synapse is actually part of our product team and what we call EPD operations, which is engineering, product, and design. And our team is mostly engineers. I’m the product lead for it, but we help solve and reduce toil for these cross-functional teams inside of GitHub, mostly building internal tooling, with the focus on process automation and AI. But we also have a speculative part of our practice as well: trying to imagine the future of cross-functional teams working together and how they might do that with agents, for example.

00.45: Actually, you are the first person I’ve come across who’s used the word “toil.” Usually “tedium” is what people use, in terms of describing the parts of their job that they would rather automate. So you’re actually a big proponent of talking about agents that go beyond coding agents.

01.03: Yeah. That’s right. 

01.05: And specifically in your context for product people. 

01.09: And actually, for just the way that, say, product people work with their cross-functional teams. But I would also include other types of functions, legal privacy and customer support docs, any of these people that are working to actually help build a product; I think there needs to be a transformation of the way we think about these tools.

01.29: GitHub is a very engineering-led organization as well as a very engineering-focused organization. But my role is to really think about “How do we do a better job between all these people that I would call nontechnical—but they are sometimes technical, of course, but the people that are not necessarily there to write code. . . How do we actually work together to build great products?” And so that’s what I think about work. 

01.48: For people who aren’t familiar with product management and product teams, what’s toil in the context of product teams? 

02.00: So toil is actually something that I stole from a Google SRE from the standpoint of any type of thing that someone has to do that is manual, tactical, repetitive. . . It usually doesn’t really add to the value of the product in any way. It’s something that as the team gets bigger or the product goes down the SDLC or lifecycle, it scales linearly, with the fact that you’re building bigger and bigger things. And so it’s usually something that we want to try to cut out, because not only is it potentially a waste of time, but there’s also a perception within the team it can cause burnout.

02.35: If I have to constantly be doing toilsome parts of my work, I feel I’m doing things that don’t really matter rather than focusing on the things that really matter. And what I would argue is especially for product managers and cross-functional teams, a lot of the time that is processes that they have to use, usually to share information within larger organizations.

02.54: A good example of that is status reporting. Status reporting is one of those things where people will spend anywhere from 30 minutes to hours per week. And sometimes it’s in certain parts of the team—technical product managers, product managers, engineering managers, program managers are all dealing with this aspect that they have to in some way summarize the work that the team is doing and then shar[e] that not only with their leadership. . . They want to build trust with their leadership, that they’re making the right decisions, that they’re making the right calls. They’re able to escalate when they need help. But also then to convey information to other teams that are dependent on them or they’re dependent on. Again, this is [in] very large organizations, [where] there’s a huge cost to communication flows.

03.35: And so that’s why I use status reporting as a good example of that. Now with the use of the things like LLMs, especially if we think about our LLMs as a compression engine or a translation engine, we can then start to use these tools inside of these processes around status reporting to make it less toilsome. But there’s still aspects of it that we want to keep that are really about humans understanding, making decisions, things like that. 

03:59: And this is key. So one of the concerns that people have is about a hollowing out in the following context: If you eliminate toil in general, the problem there is that your most junior or entry-level employees actually learn about the culture of the organization by doing toil. There’s some level of toil that becomes part of the onboarding in the acculturation of young employees. But on the other hand, this is a challenge for organizations to just change how they onboard new employees and what kinds of tasks they give them and how they learn more about the culture of the organization.

04.51: I would differentiate between the idea of toil and paying your dues within the organization. In investment banking, there’s a whole concern about that: “They just need to sit in the office for 12 hours a day to really get the culture here.” And I would differentiate that from. . .

05:04: Or “Get this slide to pitch decks and make sure all the fonts are the right fonts.”

05.11: That’s right. Yeah, I worked at Facebook Reality Labs, and there were many times where we would do a Zuck review, and getting those slides perfect was a huge task for the team. What I would say is I want to differentiate this from the gaining of expertise. So if we think about Gary Klein, naturalistic decision making, real expertise is actually about being able to see an environment. And that could be a data environment [or] information environment as well. And then as you gain expertise, you’re able to discern between important signals and noise. And so what I’m not advocating for is to remove the ability to gain that expertise. But I am saying that toilsome work doesn’t necessarily contribute to expertise. 

05.49: In the case of status reporting as an example—status reporting is very valuable for a person to be able to understand what is going on with the team, and then, “What actions do I need to take?” And we don’t want to remove that. But the idea that a TPM or product manager or EM has to dig through all of the different issues that are inside of a particular repo to look for specific updates and then do their own synthesis of a draft, I think there is a difference there. And so what I would say is that the idea of me reading this information in a way that is very convenient for me to consume and then to be able to shape the signal that I then put out into the organization as a status report, that is still very much a human decision.

06.30: And I think that’s where we can start to use tools. Ethan Mollick has talked about this a lot in the way that he’s trying to approach including LLMs in, say, the classroom. There’s two patterns that I think could come out of this. One is that when I have some type of early draft of something, I should be able to get a lot of early feedback that is very low reputational risk. And what I mean by that is that a bot can tell me “Hey, this is not written in a way with the active voice” or “[This] is not really talking about the impact of this on the organization.” And so I can get that super early feedback in a way that is not going to hurt me.

If I publish a really bad status report, people may think less of me inside the organization. But using a bot or an agent or just a prompt to even just say, “Hey, these are the ways you can improve this”—that type of early feedback is really, really valuable. That I have a draft and I get critique from a bunch of different viewpoints I think is super valuable and will build expertise.

07.24: And then there’s the other side, which is, when we talk about consuming lots of information and then synthesizing or translating it into a draft, I can then critique “Is this actually valuable to the way that I think that this leader thinks? Or what I’m trying to convey as an impact?” And so then I am critiquing the straw man that is output by these prompts and agents.

07.46: Those two different patterns together actually create a really great loop for me to be able to learn not only from agents but also from the standpoint of seeing how. . . The part that ends up being really exciting is when once you start to connect the way communication happens inside the organization, I can then see what my leaders passed on to the next leader or what this person interpreted this as. And I can use that as a feedback loop to then improve, over time, my expertise in, say, writing a status report that is shaped for the leader. There’s also a whole thing that when we talk about status reporting in particular, there is a difference in expertise that people are getting that I’m not always 100%. . .

08.21: It’s valuable for me to understand how my leader thinks and makes decisions. I think that is very valuable. But the idea that I will spend hours and hours shaping and formulating a status report from my point of view for someone else can be aided by these types of systems. And so status should not be about the speaker’s mouth; it should be at the listener’s ear.

For these leaders, they want to be able to understand “Are the teams making the right decisions? Do I trust them? And then where should I preemptively intervene because of my experience or maybe my understanding of the context in the broader organization?” And so that’s what I would say: These tools are very valuable in helping build that expertise.

09.00: It’s just that we have to rethink “What is expertise?” And I just don’t buy it that paying your dues is the way you gain expertise. You do sometimes. Absolutely. But a lot of it is also just busy work and toil. 

09.11: My thing is these are productivity tools. And so you make even your junior employees productive—you just change the way you use your more-junior employees. 

09.24: Maybe just one thing to add to this is that there is something really interesting inside of the education world of using LLMs: trying to understand where someone is at. And so the type of feedback that someone that is very early in their career or first to doing something is potentially very different in the way that you’re teaching them or giving them feedback versus something that someone that is much further in expertise, they want to be able to just get down to “What are some things I’m missing here? Where am I biased?” Those are things where I think we also need to do a better job for those early employees, the people that are just starting to get expertise—“How do we train them using these tools as well as other ways?”

10.01: And I’ve done that as well. I do a lot of learning and development help, internal to companies, and I did that as part of the PM faculty for learning in development at Google. And so thinking a lot about how PMs gain expertise, I think we’re doing a real disservice to making it so that product manager as a junior position is so hard to get.

10.18: I think it’s really bad because, right out of college, I started doing program management, and it taught me so much about this. But at Microsoft, when I joined, we would say that the program manager wasn’t really worth very much for the first two years, right? Because they’re gaining expertise in this.

And so I think LLMs can help give the ability for people to gain expertise faster and also help them from avoiding making errors that other people might make. But I think there’s a lot to do with just learning and development in general that we need to pair with LLMs and human systems.

10.52: In terms of agents, I guess agents for product management, first of all, do they exist? And if they do, I always like to look at what level of autonomy they really have. Most agents really are still partially autonomous, right? There’s still a human in the loop. And so the question is “How much is the human in the loop?” It’s kind of like a self-driving car. There’s driver assists, and then there’s all the way to self-driving. A lot of the agents right now are “driver assist.” 

11.28: I think you’re right. That’s why I don’t always use the term “agent,” because it’s not an autonomous system that is storing memory using tools, constantly operating.

I would argue though that there is no such thing as “human out of the loop.” We’re probably just drawing the system diagram wrong if we’re saying that there’s no human that’s involved in some way. That’s the first thing. 

11.53: The second thing I’d say is that I think you’re right. A lot of the time right now, it ends up being when the human needs the help, we end up creating systems inside of GitHub; we have something that’s called GitHub spaces, which is really a custom GPT. It’s really just a bundling of context that I can then go to when I need help with a particular type of thing. We built very highly specific types of copilot spaces, like “I need to write a blog announcement about something. And so what’s the GitHub writing style? How should I be wording this avoiding jargon?” Internal things like that. So it can be highly specific. 

We also have more general tools that are kind of like “How do I form and maintain initiatives throughout the entire software development lifecycle? When do I need certain types of feedback? When do I need to generate the 12 to 14 different documents that compliance and downstream teams need?” And so those tend to be operating in the background to autodraft these things based on the context that’s available. And so that’s I’d say that’s semiagentic, to a certain extent. 

12.52: But I think actually there’s really big opportunities when it comes to. . . One of the cases that we’re working on right now is actually linking information in the GitHub graph that is not commonly linked. And so a key example of that might be kicking off all of the process that goes along with doing a release. 

When I first get started, I actually want to know in our customer feedback repo, in all the different places where we store customer feedback, “Where are there times that customers actually asked about this or complained about it or had some information about this?” And so when I get started, being able to automatically link something like a release tracking issue with all of this customer feedback becomes really valuable. But it’s very hard for me as an individual to do that. And what we really want—and what we’re building—[are] things that are more and more autonomous about constantly searching for feedback or information that we can then connect to this release tracking issue.

13.44: So that’s why I say we’re starting to get into the autonomous realm when it comes to this idea of something going around looking for linkages that don’t exist today. And so that’s one of those things, because again, we’re talking about information flow. And a lot of the time, especially in organizations the size of GitHub, there’s lots of siloing that takes place.

We have lots of repos. We have lots of information. And so it’s really hard for a single person to ever keep all of that in their head and to know where to go, and so [we’re] bringing all of that into the tools that they end up using. 

14.14: So for example, we’ve also created internal things—these are more assist-type use cases—but the idea of a Gemini Gem inside of a Google doc or an M365 agent inside of Word that is then also connected to the GitHub graph in some way. I think it’s “When do we expose this information? Is it always happening in the background, or is it only when I’m drafting the next version of this initiative that ends up becoming really, really important?”

14.41: Some of the work we’ve been experimenting with is actually “How do we start to include agents inside of the synchronous meetings that we actually do?” You probably don’t want an agent to suddenly start speaking, especially because there’s lots of different agents that you may want to have in a meeting.

We don’t have a designer on our team, so I actually end up using an agent that is prompted to be like a designer and think like a designer inside of these meetings. And so we probably don’t want them to speak up dynamically inside the meeting, but we do want them to add information if it’s helpful. 

We want to autoprototype things as a straw man for us to be able to react to. We want to start to use our planning agents and stuff like that to help us plan out “What is the work that might need to take place?” It’s a lot of experimentation about “How do we actually pull things into the places that humans are doing the work?”—which is usually synchronous meetings, some types of asynchronous communication like Teams or Slack, things like that.

15.32: So that’s where I’d say the full possibility [is] for, say, a PM. And our customers are also TPMs and leaders and people like that. It really has to do with “How are we linking synchronous and asynchronous conversations with all of this information that is out there in the ecosystem of our organization that we don’t know about yet, or viewpoints that we don’t have that we need to have in this conversation?”

15.55: You mentioned the notion of a design agent passively in the background, attending a meeting. This is fascinating. So this design agent, what is it? Is it a fine-tuned agent or. . .? What exactly makes it a design agent? 

16.13: In this particular case, it’s a specific prompt that defines what a designer would usually do in a cross-functional team and what they might ask questions about, what they would want clarification of. . .

16.26: Completely reliant on the pretrained foundation model—no posttraining, no RAG, nothing? 

16.32: No, no. [Everything is in the prompt] at this point. 

16.36: How big is this prompt? 

16.37: It’s not that big. I’d say it’s maybe at most 50 lines, something like that. It’s pretty small. The truth is, the idea of a designer is something that LLMs know about. But more for our specific case, right now it’s really just based on this live conversation. And there’s a lot of papercuts in the way that we have to do a site call, pull a live transcript, put it into a space, and [then] I have a bunch of different agents that are inside the space that will then pipe up when they have something interesting to say, essentially.

And it’s a little weird because I have to share my screen and people have to read it, hold the meeting. So it’s clunky right now in the way that we bring this in. But what it will bring up is “Hey, these are patterns inside of design that you may want to think about.” Or you know, “For this particular part of the experience, it’s still pretty ambiguous. Do you want to define more about what this part of the process is?” And we’ve also included legal, privacy, data-oriented groups. Even the idea of a facilitator agent saying that we were getting off track or we have these other things to discuss, that type of stuff. So again, these are really rudimentary right now.

17.37: Now, what I could imagine though is, we have a design system inside of GitHub. How might we start to use that design system and use internal prototyping tools to autogenerate possibilities for what we’re talking about? And I guess when I think about using prototyping as a PM, I don’t think the PMs should be vibe coding everything.

I don’t think the prototype replaces a lot of the cross-functional documents that we have today. But I think what it does increase is that if we have been talking about a feature for about 30 minutes, that is a lot of interesting context that if we can say, “Autogenerate three different prototypes that are coming from slightly different directions, slightly different places that we might integrate inside of our current product,” I think what it does is it gives us, again, that straw man for us to be able to critique, which will then uncover additional assumptions, additional values, additional principles that we maybe haven’t written down somewhere else.

18.32: And so I see that as super valuable. And that’s the thing that we end up doing—we’ll use an internal product for prototyping to just take that and then have it autogenerated. It takes a little while right now, you know, a couple minutes to do a prototype generation. And so in those cases we’ll just [say], “Here’s what we thought about so far. Just give us a prototype.” And again it doesn’t always do the right thing, but at least it gives us something to now talk about because it’s more real now. It is not the thing that we end up implementing, but it is the thing that we end up talking about. 

18.59: By the way, this notion of an agent attending synchronous some meeting, you can imagine taking it to the next level, which is to take advantage of multimodal models. The agent can then absorb speech and maybe visual cues, so then basically when the agent suggests something and someone reacts with a frown. . . 

19.25: I think there’s something really interesting about that. And when you talk about multimodal, I do think that one of the things that is really important about human communication is the way that we pick up cues from each other—if we think about it, the reason why we actually talk to each other. . . And there’s a great book called The Enigma of Reason that’s all about this.

But their hypothesis is that, yes, we can try to logic or pretend to logic inside of our own heads, but we actually do a lot of post hoc analysis. So we come up with an idea inside our head. We have some certainty around it, some intuition, and then we fit it to why we thought about this. So that’s what we do internally. 

But when you and I are talking, I’m actually trying to read your mind in some way. I’m trying to understand the norms that are at play. And I’m using your facial expression. I’m using your tone of voice. I’m using what you’re saying—actually way less of what you’re saying and more your facial expression and your tone of voice—to determine what’s going on.

20.16: And so I think this idea of engagement with these tools and the way these tools work, I think [of] the idea of gaze tracking: What are people looking at? What are people talking about? How are people reacting to this? And then I think this is where in the future, in some of the early prototypes we built internally for what the synchronous meeting would look like, we have it where the agent is raising its hand and saying, “Here’s an issue that we may want to discuss.” If the people want to discuss it, they can discuss it, or they can ignore it. 

20.41: Longer term, we have to start to think about how agents are fitting into the turn-taking of conversation with the rest of the group. And using all of these multimodal cues ends up being very interesting, because you wouldn’t want just an agent whenever it thinks of something to just blurt it out.

20.59: And so there’s a lot of work to do here, but I think there’s something really exciting about just using engagement as the meaning to understand what are the hot topics, but also trying to help detect “Are we rat-holing on something that should be put in the parking lot?” Those are things and cues that we can start to get from these systems as well.

21.16: By the way, context has multiple dimensions. So you can imagine in a meeting between the two of us, you outrank me. You’re my manager. But then it turns out the agent realizes, “Well, actually, looking through the data in the company, Ben knows more about this topic than Chris. So maybe when I start absorbing their input, I should weigh Ben’s, even though in the org chart Chris outranks Ben.” 

21.46: A related story is one of the things I’ve created inside of a copilot space is actually a proxy for our CPO. And so what I’ve done is I’ve taken meetings that he’s done where he asked questions in a smaller setting, taking his writing samples and things that, and I’ve tried to turn it into a, not really an agent, but a space where I can say, “Here’s what I’m thinking about for this plan. And what would Mario [Rodriguez] potentially think about this?” 

It’s definitely not 100% accurate in any way. Mario’s an individual that is constantly changing and is learning and has intuitions that he doesn’t say out loud, but it is interesting how it does sound like him. It does seem to focus on questions that he would bring up in a previous meeting based on the context that we provided. And so I think to your point, a lot of things that right now are said inside of meetings that we then don’t use to actually help understand people’s points of view in a deeper way.

22.40: You could imagine that this proxy also could be used for [determining] potential blind spots for Mario that, as a person that is working on this, I may need to deal with, in the sense that maybe he’s not always focused on this type of issue, but I think it’s a really big deal. So how do I help him actually understand what’s going on?

22.57: And this gets back to that reporting: Is that the listener’s ear? What does that person actually care about? What do they need to know about to build trust with the team? What do they need to take action on? Those are things that I think we can start to build interesting profiles. 

There’s a really interesting ethical question, which is: Should that person be able to write their own proxy? Would it include the blind spots that they have or not? And then maybe compare this to—you know, there’s [been] a trend for a little while where every leader would write their own user manual or readme, and inside of those things, they tend to be a bit more performative. It’s more about how they idealize their behavior versus the way that they actually are.

23.37: And so there’s some interesting problems that start to come up when we’re doing proxying. I don’t call it a digital twin of a person, because digital twins to me are basically simulations of mechanical things. But to me it’s “What is this proxy that might sit in this meeting to help give us a perspective and maybe even identify when this is something we should escalate to that person?”

23.55: I think there’s lots of very interesting things. Power structures inside of the organization are really hard to discern because there’s both, to your point, hierarchical ones that are very set in the systems that are there, but there’s also unsaid ones. 

I mean, one funny story is Ray Dalio did try to implement this inside of his hedge fund. And unfortunately, I guess, for him, there were two people that were considered to be higher ranking in reputation than him. But then he changed the system so that he was ranked number one. So I guess we have to worry about this type of thing for these proxies as well. 

24.27: One of the reasons why coding is such a great playground for these things is one, you can validate the result. But secondly, the data is quite tame and relatively right. So you have version control systems GitHub—you can look through that and say, “Hey, actually Ben’s commits are much more valuable than Chris’s commits.” Or “Ben is the one who suggested all of these changes before, and they were all accepted. So maybe we should really take Ben’s opinion much more strong[ly].” I don’t know what artifacts you have in the product management space that can help develop this reputation score.

25.09: Yeah. It’s tough because a reputation score, especially once you start to monitor some type of metric and it becomes the goal, that’s where we get into problems. For example, Agile teams adopting velocity as a metric: It’s meant to be an internal metric that helps us understand “If this person is out, how does that adjust what type of work we need to do?” But then comparing velocities between different teams ends up creating a whole can of worms around “Is this actually the metric that we’re trying to optimize for?”

25.37: And even when it comes to product management, what I would say is actually valuable a lot of the time is “Does the team understand why they’re working on something? How does it link to the broader strategy? How does this solve both business and customer needs? And then how are we wrangling this uncertainty of the world?” 

I would argue that a really key meta skill for product managers—and for other people like generative user researchers, business development people, you know, even leaders inside the organization—they have to deal with a lot of uncertainty. And it’s not that we need to shut down the uncertainty, because actually uncertainty is an advantage that we should take advantage of and something we should use in some way. But there are places where we need to be able to build enough certainty for the team to do their work and then make plans that are resilient in the future uncertainty. 

26.24: And then finally, the ability to communicate what the team is doing and why it’s important is very valuable. Unfortunately, there’s not a lot of. . . Maybe there’s rubrics we can build. And that’s actually what career ladders try to do for product managers. But they tend to be very vague actually. And as you get more senior inside of a product manager organization, you start to see things—it’s really just broader views, more complexity. That’s really what we start to judge product managers on. Because of that fact, it’s really about “How are you working across the team?”

26.55: There will be cases, though, that we can start to say, “Is this thing thought out well enough at first, at least for the team to be able to take action?” And then linking that work as a team to outcomes ends up being something that we can apply more and more data rigor to. But I worry about it being “This initiative brief was perfect, and so that meant the success of the product,” when the reality was that was maybe the starting point, but there was all this other stuff that the product manager and the team was doing together. So I’m always wary of that. And that’s where performance management for PMs is actually pretty hard: where you have to base most of your understanding on how they work with the other teammates inside their team.

27.35: You’ve been in product for a long time so you have a lot of you have a network of peers in other companies, right? What are one or two examples of the use of AI—not in GitHub—in the product management context that you admire? 

27.53: For a lot of the people that I know that are inside of startups that are basically using prototyping tools to build out their initial product, I have a lot of, not necessarily envy, but I respect that a lot because you have to be so scrappy inside of a startup, and you’re really there to not only prove something to a customer, or actually not even prove something, but get validation from customers that you’re building the right thing. And so I think that type of rapid prototyping is something that is super valuable for that stage of an organization.

28.26: When I start to then look at larger enterprises, what I do see that I think is not as well a help with these prototyping tools is what we’ll call brownfield development: We need to build something on top of this other thing. It’s actually hard to use these tools today to imagine new things inside of a current ecosystem or a current design system.

28.46: [For] a lot of the teams that are in other places, it really is a struggle to get access to some of these tools. The thing that’s holding back the biggest enterprises from actually doing interesting work in this area is they’re overconstraining what their engineers [and] product managers can use as far as these tools.

And so what’s actually being created is shadow systems, where the person is using their personal ChatGPT to actually do the work rather than something that’s within the compliance of the organization.

29:18: Which is great for IP protection. 

29:19: Exactly! That’s the problem, right? Some of this stuff, you do want to use the most current tools. Because there is actually not just [the] time savings aspect and toil reduction aspects—there’s also just the fact that it helps you think differently, especially if you’re an expert in your domain. It really aids you in becoming even better at what you’re doing. And then it also shores up some of your weaknesses. Those are the things that really expert people are using these types of tools for. But in the end, it comes down to a combination of legal, HR, and IT, and budgetary types of things too, that are holding back some of these organizations.

30.00: When I’m talking to other people inside of the orgs. . . Maybe another problem for enterprises right now is that a lot of these tools require lots of different context. We’ve benefited inside of GitHub in that a lot of our context is inside the GitHub graph, so Copilot can access it and use it. But for other teams they keep things and all of these individual vendor platforms.

And so the biggest problem then ends up being “How do we merge these different pieces of context in a way that is allowed?” When I first started working in the team of Synapse, I looked at the patterns that we were building and it was like “If we just had access to Zapier or Relay or something like that, that is exactly what we need right now.” Except we would not have any of the approvals for the connectors to all of these different systems. And so Airtable is a great example of something like that too: They’re building out process automation platforms that focus on data as well as connecting to other data sources, plus the idea of including LLMs as components inside these processes.

30.58: A really big issue I see for enterprises in general is the connectivity issue between all the datasets. And there are, of course, teams that are working on this—Glean or others that are trying to be more of an overall data copilot frontend for your entire enterprise datasets. But I just haven’t seen as much success in getting all these connected. 

31.17: I think one of the things that people don’t realize is enterprise search is not turnkey. You have to get in there and really do all these integrations. There’s no shortcuts. There’s no, if a vendor comes to you and says, yeah, just use our system, it all magically works.

31.37: This is why we need to hire more people with degrees in library science, because they actually know how to manage these types of systems. Again, my first cutting my teeth on this was in very early versions of SharePoint a long time ago. And even inside there, there’s so much that you need to do to just help people with not only organization of the data but even just the search itself.

It’s not just a search index problem. It’s a bunch of different things. And that’s why whenever we’re shown an empty text box, that’s why there’s so much work that goes into just behind that; inside of Google, all of the instant answers, there’s lots of different ways that a particular search query is actually looked at, not just to go against the search index but to also just provide you the right information. And now they’re trying to include Gemini by default in there. The same thing happens within any copilot. There’s a million different things you could use. 

32.27: And so I guess maybe this gets to my hypothesis about the way that agents will be valuable, either fully autonomous ones or ones that are attached to a particular process. But having many different agents that are highly biased in a particular way. And I use the term bias as in bias can be good, neutral, and bad, right? I don’t mean bias in a way of unfairness and that type of stuff; I mean more from the standpoint of “This agent is meant to represent this viewpoint, and it’s going to give you feedback from this viewpoint.” That ends up becoming really, really valuable because of that fact that you will not always be thinking about everything. 

33.00: I’ve done a lot of work in adversarial thinking and red teaming and stuff like that. One of the things that is most valuable is to build prompts that are breaking the sycophancy of these different models that are there by default, because it should be about challenging my thinking rather than just agreeing with it.

And then the standpoint of each one of these highly biased agents actually helps provide a very interesting approach. I mean, if we go to things like meeting facilitation or workshop facilitation groups, this is why. . . I don’t know if you’re familiar with the six hats, but the six hats is a technique by which we declare inside of a meeting that I’m going to be the one that’s all positivity. This person’s going to be the one about data. This person’s gonna be the one that’s the adversarial, negative one, etc., etc. When you have all of these different viewpoints, you actually end up because of the tensions in the discussion of those ideas, the creation of options, the weighing of options, I think you end up making much better decisions. That’s where I think those highly biased viewpoints end up becoming really valuable. 

34.00: For product people who are early in their career or want to enter the field, what are some resources that they should be looking at in terms of leveling up on the use AI in this context?

34.17: The first thing is there are millions of prompt libraries out there for product managers. What you should do is when you are creating work, you should be using a lot of these prompts to give you feedback, and you can actually even write your own, if you want to. But I would say there’s lots of material out there for “I need to write this thing.”

What is a way to [do something like] “I try to write it and then I get critique”? But then how might this AI system, through a prompt, generate a draft of this thing? And then I go in and look at it and say, “Which things are not actually quite right here?” And I think that again, those two patterns of getting critique and giving critique end up building a lot of expertise.

34.55: I think also within the organization itself, I believe an awful lot in things that are called basically “learning from your peers.” Being able to join small groups where you are getting feedback from your peers and including AI agent feedback inside of the small peer groups is very valuable. 

There’s another technique, which is using case studies. And I actually, as part of my learning development practice, do something called “decision forcing cases” where we take a story that actually happened, we walk people through it and we ask them, “What do they think is happening; what would they do next?” But having that where you do those types of things across junior and senior people, you can start to actually learn the expertise from the senior people through these types of case studies.

35.37: I think there’s an awful lot more that senior leaders inside the organization should be doing. And as junior people inside your organization, you should be going to these senior leaders and saying, “How do you think about this? What is the way that you make these decisions?” Because what you’re actually pulling from is their past experience and expertise that they’ve gained to build that intuition.

35.53: There’s all sorts of surveys of programmers and engineers and AI. Are there surveys about product managers? Are they freaked out or what? What’s the state of adoption and this kind of thing? 

36.00: Almost every PM that I’ve met has used an LLM in some way, to help them with their writing in particular. And if you look at the studies by ChatGPT or OpenAI about the use of ChatGPT, a lot of the writing tasks end up being from a product manager or senior leader standpoint. I think people are freaked out because every practice says that this other practice is going to be replaced because I can in some way replace them right now with a viewpoint.

36.38: I don’t think product management will go away. We may change the terminology that we end up using. But this idea of someone that is helping manage the complexity of the team, help with communication, help with [the] decision-making process inside that team is still very valuable and will be valuable even when we can start to autodraft a PRD.

I would argue that the draft of the PRD is not what matters. It’s actually the discussions that take place in the team after the PRD is created. And I don’t think that designers are going to take over the PM work because, yes, it is about to a certain extent the interaction patterns and the usability of things and the design and the feeling of things. But there’s all these other things that you need to worry about when it comes to matching it to business models, matching it to customer mindsets, deciding which problems to solve. They’re doing that. 

37.27: There’s a lot of this concern about [how] every practice is saying this other practice is going to go away because of AI. I just don’t think that’s true. I just think we’re all going to be given different levels of abstraction to gain expertise on. But the core of what we do—an engineer focusing on what is maintainable and buildable and actually something that we want to work on versus the designer that’s building something usable and something that people will feel good using, and a product manager making sure that we’re actually building the thing that is best for the company and the user—those are things that will continue to exist even with these AI tools, prototyping tools, etc.

38.01: And for our listeners, as Chris mentioned, there’s many, many prompt templates for product managers. We’ll try to get Chris to recommend one, and we’ll put it in the episode notes. [See “Resources from Chris” below.] And with that thank you, Chris. 

38.18: Thank you very much. Great to be here.

Resources from Chris

Here’s what Chris shared with us following the recording:

There are two [prompt resources for product managers] that I think people should check out:

However, I’d say that people should take these as a starting point and they should adapt them for their own needs. There is always going to be nuance for their roles, so they should look at how people do the prompting and modify for their own use. I tend to look at other people’s prompts and then write my own.

If they are thinking about using prompts frequently, I’d make a plug for Copilot Spaces to pull that context together.

💾

Generative AI in the Real World: Context Engineering with Drew Breunig

16 October 2025 at 07:18

In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.

00.23: Thanks, Ben. Thanks for having me on here. 

00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today? 

00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models. 

And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. 

02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt. 

But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse. 

And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. 

03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window. 

04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing. 

04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific. 

04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. 

05.14: And then I think one of the very immediate lessons that people realize is, just because. . . 

So one of the things that these model providers do when they have a model release,  one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.

05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope. 

And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . .  Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.

07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.

And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge. 

08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model. 

08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.

Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . 

09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem. 

Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.

09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.

And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. 

10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context. 

11.08: This is a really big problem. 

11.09: So this is true even if the context window’s small actually. 

11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform. 

A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.

12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.” 

The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. 

But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. 

13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.

I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting. 

14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”

OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt. 

15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails. 

16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over. 

17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.

18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.

I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.

19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?

19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. 

19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .

19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like. 

I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. 

20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he. 

21.15: This is back in the Objective-C days. . .

21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert. 

22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data. 

22.37: Throw in BAML?

22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.

22.52: BAML and TextGrad? 

22.55: TextGrad is more like the prompt optimization I’m talking about. 

22:57: TextGrad plus GEPA plus Regolo?

23.02: Yeah, those things are really important. And the reason I say they’re important is. . .

23.08: I mean, Drew, those are kind of advanced topics. 

23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.

23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .

23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. 

23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.

24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.

And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. 

25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” 

25.16: In the loop. . .

25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .

25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. 

25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?

Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. 

26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.

27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me. 

27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.

28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat. 

29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.

We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.

30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.

Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.

31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience. 

31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? 

31.57: [laughs] I mean, I wish you had given me some time to think about it. 

31.58: We are in a hype cycle here. . .

32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different. 

I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. 

But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.

33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces. 

She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant. 

34.34: That’s my handle on Twitter. 

34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.

And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”

35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.

And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.

36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. 

I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. 

37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way. 

37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties. 

37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.

37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . . 

38.00: Although I think it’s too boring a term, Drew, to take off.

38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things. 

But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .

39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.

40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging. 

And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term. 

40.56: Now with that. Thank you, Drew. 

40.58: Awesome. Thank you for having me, Ben.

💾

Generative AI in the Real World: Faye Zhang on Using AI to Improve Discovery

18 September 2025 at 06:12

In this episode, Ben Lorica and AI engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0:00: Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.

0:14: Thanks, Ben. Huge fan of the work. I’ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O’Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here. 

0:33: All right, so let’s jump right in. So one of the first things I really wanted to talk to you about is this work around PinLanding. And you’ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?

0:53: Yeah, that’s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We’re living through the greatest paradox of the digital economy. Essentially, there’s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom’s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that’s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we’re talking about a $6.5 trillion market, according to Shopify’s projections, where every failed product discovery is money left on the table. So that’s what we’re trying to solve—essentially solve the semantic organization of all platforms versus user context or search. 

2:05: So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?

2:22: There have been researchers across the past decade working on this problem; we’re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier. 

3:03: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together. 

3:55: To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this? 

4:11: Definitely. I think there are internal and external benchmarks. For the external ones, it’s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.

4:47: The other topic I wanted to talk to you about is recommendation systems. So obviously there’s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?

5:23: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are. 

5:58: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc. 

6:17: And I think of other bigger themes we’re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations. 

6:44: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It’s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure. 

7:15: Generally it sounds like the themes from years past still map over in the following sense, right? So there’s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?

7:53: Correct. Yes, I would say so. 

7:55: And so the foundation models help you on the content side but not necessarily on the behavior side?

8:03: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it’s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?” 

8:31: I’m not sure this is happening, so correct me if I’m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct? 

9:05: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them. 

9:29: For the listeners who don’t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications? 

9:44: Yeah, that’s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this. 

9:56: Maybe Faye, first define what you mean by that, in case listeners don’t know what that is. 

10:02: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model. 

10:24: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn’t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.

11:21: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they’re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that’s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we’re going to see a lot more in the production work as well. 

11:57: By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference.  

12:24: I think that’s very much true. Although I can’t claim to be an expert on it because I know most recommendation systems deal with monetization, so it’s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that…

12:42: And it’s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.

13:18: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you’re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about. 

13:41: Which brings to mind another topic that is also something I’ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform. 

14:28: I think that’s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you’re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true. 

14:51: The last topic I wanted to talk to you about is context engineering. I’ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can’t just stuff the context window full, because one, it’s inefficient. And two, actually, the LLM can still make mistakes because it’s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work? 

15:38: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it’s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?

16:08: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.

16:33: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline. 

17:15: I’m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “Structure Is All You Need.” So basically whatever structure you have, you should help the model, right? If it’s in a SQL database, then maybe you can expose the structure of the data. If it’s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn’t make any sense to do that anyway.

18:30: What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar? 

18:52: I think, to better utilize the concept of “contextual engineering,” that they’re essentially two loops. There’s number one within the loop of what happened. Yes. Within the LLMs. And then there’s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there’s the vector plus Excel or regex extraction. There’s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely. 

20:07: One of the things I wish—and I don’t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it’ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they’re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it’ll impact what kinds of applications we’d be comfortable deploying these things in. For example, for agents. If I’m using an agent to use a bunch of tools, but I can’t really predict their behavior, that impacts the types of applications I’d be comfortable using a model for. 

21:18: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you’re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that’s the fully autonomous engineer peer. It just takes things, and you don’t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin’s parent company. 

22:05: And with that, thank you, Faye.

22:08: Awesome. Thank you, Ben.

Generative AI in the Real World: Luke Wroblewski on When Databases Talk Agent-Speak

4 September 2025 at 12:01

Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Timestamps

  • 0:00: Introduction to Luke Wroblewski of Sutter Hill Ventures. 
  • 0:36: You’ve talked about a paradigm shift in how we write applications. You’ve said that all we need is a URL and model, and that’s an app. Has anyone else made a similar observation? Have you noticed substantial apps that look like this?
  • 1:08: The future is here; it’s just not evenly distributed yet. That’s what everyone loves to say. The first websites looked nothing like robust web applications, and now we have a multimedia podcast studio running in the browser. We’re at the phase where some of these things look and feel less robust. And our ideas for what constitutes an application change in each of these phases. If I told you pre-Google Maps that we’d be running all of our web applications in a browser, you’d have laughed at me. 
  • 2:13: I think what you mean is an MCP server, and the model itself is the application, correct?
  • 2:24: Yes. The current definition of an application, in a simple form, is running code and a database. We’re at the stage where you have AI coding agents that can handle the coding part. But we haven’t really had databases that have been designed for the way those agents think about code and interacting with data.
  • 2:57: Now that we have databases that work the way agents work, you can take out the running-code part almost. People go to Lovable or Cursor and they’re forced to look at code syntax. But if an AI model can just use a database effectively, it takes the role of the running code. And if it can manage data visualizations and UI, you don’t need to touch the code. You just need to point the AI at a data structure it can use effectively. MCP UI is a nice example of people pushing in this direction.
  • 4:12: Which brings us to something you announced recently: AgentDB. You can find it at agentdb.dev. What problem is AgentDB trying to solve?
  • 4:34: Related to what we were just talking about: How do we get AI agents to use databases effectively? Most things in the technology stack are made for humans and the scale at which humans operate.
  • 5:06: They’re still designed for a DBA, but eliminating the command line, right? So you still have to have an understanding of DBA principles?
  • 5:19: How do you pick between the different compute options? How do you pick a region? What are the security options? And it’s not something you’re going to do thousands of times a day. Databricks just shared some stats where they said that thousands of databases per agent get made a day. They think 99% of databases being made are going to be made by agents. What is making all these databases? No longer humans. And the scale at which they make them—thousands is a lowball number. It will be way, way higher than that. How do we make a database system that works in that reality?
  • 6:22: So the high-level thesis here is that lots of people will be creating agents, and these agents will rely on something that looks like a database, and many of these people won’t be hardcore engineers. What else?
  • 6:45: It’s also agents creating agents, and agents creating applications, and agents deciding they need a database to complete a task. The explosion of these smart machine uses and workflows is well underway. But we don’t have an infrastructure that was made for that world. They were all designed to work with humans.
  • 7:31: So in the classic database world, you’d consider AgentDB more like OLTP rather than analytics and OLAP.
  • 7:42: Yeah, for analytics you’d probably stick your log somewhere else. The characteristics that make AgentDB really interesting for agents is, number 1: To create a database, all you really need is a unique ID. The creation of the ID manifests a database out of thin air. And we store it as a file, so you can scale like crazy. And all of these databases are fully isolated. They’re also downloadable, deletable, releasable—all the characteristics of a filesystem. We also have the concept of a template that comes along with the database. That gives the AI model or agent all the context it needs to start using the database immediately. If you just point Claude at a database, it will need to look at the structure (schema). It will build tokens and time trying to get the structure of the information. And every time it does this is an opportunity to make a mistake. With AgentDB, when an agent or an AI model is pointed at the database with a template, it can immediately write a query because we have in there a description of the database, the schema. So you save time, cut down errors, and don’t have to go through that learning step every time the model touches a database.
  • 10:22: I assume this database will have some of the features you like, like ACID, vector search. So what kinds of applications have people built using AgentDB? 
  • 10:53: We put up a little demo page where we allow you to start the process with a CSV file. You upload it, and it will create the database and give you an MCP URL. So people are doing things like personal finance. People are uploading their credit card statements, their bank statements, because those applications are horrendous.
  • 11:39: So it’s the actual statement; it parses it?
  • 11:45: Another example: Someone has a spreadsheet to track jobs. They can take that, upload it, it gives them a template and a database and an MCP URL. They can pop that job-tracking database into Claude and do all the things you can do with a chat app, like ask, “What did I look at most recently?”
  • 12:35: Do you envision it more like a DuckDB, more embedded, not really intended for really heavy transactional, high-throughput, more-than-one-table complicated schemas?
  • 12:49: We currently support DuckDB and SQLite. But there are a bunch of folks who have made multiple table apps and databases.
  • 13:09: So it’s not meant for you to build your own CRM?
  • 13:18: Actually, one of our go-to-market guys had data of people visiting the website. He can dump that as a spreadsheet. He has data of people starring repos on GitHub. He has data of people who reached out through this form. He has all of these inbound signals of customers. So he took those, dropped them in as CSV files, put it in Claude, and then he can say, “Look at these, search the web for information about these, add it to the database, sort it by priority, assign it to different reps.” It’s CRM-ish already, but super-customized to his particular use case. 
  • 14:27: So you can create basically an agentic Airtable.
  • 14:38: This means if you’re building AI applications or databases—traditionally that has been somewhat painful. This removes all that friction.
  • 15:00: Yes, and it leads to a different way of making apps. You take that CSV file, you take that MCP URL, and you have a chat app.
  • 15:17: Even though it’s accessible to regular users, it’s something developers should consider, right?
  • 15:25: We’re starting to see emergent end-user use cases, but what we put out there is for developers. 
  • 15:38: One of the other things you’ve talked about is the notion that software development has flipped. Can you explain that to our listeners?
  • 15:56: I spent eight and a half years at Google, four and a half at Yahoo, two and a half at ebay, and your traditional process of what we’re going to do next is up front: There’s a lot of drawing pictures and stuff. We had to scope engineering time. A lot of the stuff was front-loaded to figure out what we were going to build. Now with things like AI agents, you can build it and then start thinking about how it integrates inside the project. At a lot of our companies that are working with AI coding agents, I think this naturally starts to happen, that there’s a manifestation of the technology that helps you think through what the design should be, how do we integrate into the product, should we launch this? This is what I mean by “flipped.”
  • 17:41: If I’m in a company like a big bank, does this mean that engineers are running ahead?
  • 17:55: I don’t know if it’s happening in big banks yet, but it’s definitely happening in startup companies. And design teams have to think through “Here’s a bunch of stuff, let me do a wash across all that to fit in,” as opposed to spending time designing it earlier. There are pros and cons to both of these. The engineers were cleaning up the details in the previous world. Now the opposite is true: I’ve built it, now I need to design it.
  • 18:55: Does this imply a new role? There’s a new skill set that designers have to develop?
  • 19:07: There’s been this debate about “Should designers code?” Over the years lots of things have reduced the barrier to entry, and now we have an even more dramatic reduction. I’ve always been of the mindset that if you understand the medium, you will make better things. Now there’s even less of a reason not to do it.
  • 19:50: Anecdotally, what I’m observing is that the people who come from product are able to build something, but I haven’t heard as many engineers thinking about design. What are the AI tools for doing that?
  • 20:19: I hear the same thing. What I hope remains uncommoditized is taste. I’ve found that it’s very hard to teach taste to people. If I have a designer who is a good systems thinker but doesn’t have the gestalt of the visual design layer, I haven’t been able to teach that to them. But I have been able to find people with a clear sense of taste from diverse design backgrounds and get them on board with interaction design and systems thinking and applications.
  • 21:02: If you’re a young person and you’re skilled, you can go into either design or software engineering. Of course, now you’re reading articles saying “forget about software engineering.” I haven’t seen articles saying “forget about design.”
  • 21:31: I disagree with the idea that it’s a bad time to be an engineer. It’s never been more exciting.
  • 21:46: But you have to be open to that. If you’re a curmudgeon, you’re going to be in trouble.
  • 21:53: This happens with every technical platform transition. I spent so many years during the smartphone boom hearing people say, “No one is ever going to watch TV and movies on mobile.” Is it an affinity to the past, or a sense of doubt about the future? Every time, it’s been the same thing.
  • 22:37: One way to think of AgentDB is like a wedge. It addresses one clear pain point in the stack that people have to grapple with. So what’s next? Is it Kubernetes?
  • 23:09: I don’t want to go near that one! The broader context of how applications are changing—how do I create a coherent product that people understand how to use, that has aesthetics, that has a personality?—is a very wide-open question. There’s a bunch of other systems that have not been made for AI models. A simple example is search APIs. Search APIs are basically structured the same way as results pages. Here’s your 10 blue links. But an agentic model can suck up so much information. Not only should you be giving it the web page, you should be giving it the whole site. Those systems are not built for this world at all. You can go down the list of the things we use as core infrastructure and think about how they were made for a human, not the capabilities of an enormous large language model.
  • 24:39: Right now, I’m writing an article on enterprise search, and one of things people don’t realize is that it’s broken. In terms of AgentDB, do you worry about things like security, governance? There’s another place black hat attackers can go after.
  • 25:20: Absolutely. All new technologies have the light side and the dark side. It’s always been a codebreaker-codemaker stack. That doesn’t change. The attack vectors are different and, in the early stages, we don’t know what they are, so it is a cat and mouse game. There was an era when spam in email was terrible; your mailbox would be full of spam and you manually had to mark things as junk. Now you use gmail, and you don’t think about it. When was the last time you went into the junk mail tab? We built systems, we got smarter, and the average person doesn’t think about it.
  • 26:31: As you have more people building agents, and agents building agents, you have data governance, access control; suddenly you have AgentDB artifacts all over the place. 
  • 27:06: Two things here. This is an underappreciated part of this. Two years ago I launched my own personal chatbot that works off my writings. People ask me what model am I using, and how is it built? Those are partly interesting questions. But the real work in that system is constantly looking at the questions people are asking, and evaluating whether or not it responded well. I’m constantly course-correcting the system. That’s the work that a lot of people don’t do. But the thing I’m doing is applying taste, applying a perspective, defining what “good” is. For a lot of systems like enterprise search, it’s like, “We deployed the technology.” How do you know if it’s good or not? Is someone in there constantly tweaking and tuning? What makes Google Search so good? It’s constantly being re-evaluated. Or Google Translate—was this translation good or bad? Baked in early on.

Generative AI in the Real World: Understanding A2A with Heiko Hotz and Sokratis Kartakis

Everyone is talking about agents: single agents and, increasingly, multi-agent systems. What kind of applications will we build with agents, and how will we build with them? How will agents communicate with each other effectively? Why do we need a protocol like A2A to specify how they communicate? Join Ben Lorica as he talks with Heiko Hotz and Sokratis Kartakis about A2A and our agentic future.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Timestamps

  • 0:00: Intro to Heiko and Sokratis.
  • 0:24: It feels like we’re in a Cambrian explosion of frameworks. Why agent-to-agent communication? Some people might think we should focus on single-agent tooling first.
  • 0:53: Many developers start developing agents with completely different frameworks. At some point they want to link the agents together. One way is to change the code of your application. But it would be easier if you could get the agents talking the same language. 
  • 1:43: Was A2A something developers approached you for?
  • 1:53: It is fair to say that A2A is a forward-looking protocol. We see a future where one team develops an agent that does something and another team in the same organization or even outside would like to leverage that capability. An agent is very different from an API. In the past, this was done via API. With agents, I need a stateful protocol where I send a task and the agent can run asynchronously in the background and do what it needs to do. That’s the justification for the A2A protocol. No one has explicitly asked for this, but we will be there in a few months time. 
  • 3:55: For developers in this space, the most familiar is MCP, which is a single agent protocol focused on external tool integration. What is the relationship between MCP and A2A?
  • 4:26: We believe that MCP and A2A will be complementary and not rivals. MCP is specific to tools, and A2A connects agents with each other. That brings us to the question of when to wrap a functionality in a tool versus an agent. If we look at the technical implementation, that gives us some hints when to use each. An MCP tool exposes its capability by a structured schema: I need input A and B and I give you the sum. I can’t deviate from the schema. It’s also a single interaction. If I wrap the same functionality into an agent, the way I expose the functionality is different. A2A expects a natural language description of the agent’s functionality: “The agent adds two numbers.” Also, A2A is stateful. I send a request and get a result. That gives developers a hint on when to use an agent and when to use a tool. I like to use the analogy of a vending machine versus a concierge. I put money into a vending machine and push a button and get something out. I talk to a concierge and say, “I’m thirsty; buy me something to drink.”
  • 7:09: Maybe we can help our listeners make the notion of A2A even more concrete. I tell nonexperts that you’re already using an agent to some extent. Deep research is an agent. I talk to people building AI tools in finance, and I have a notion that I want to research, but I have one agent looking at earnings, another looking at other data. Do you have a canonical example you use?
  • 8:13: We can parallelize A2A with real business. Imagine separate agents that are different employees with different skills. They have their own business cards. They share the business cards with the clients. The client can understand what tasks they want to do: learn about stocks, learn about investments. So I call the right agent or server to get a specialized answer back. Each agent has a business card that describes its skills and capabilities. I can talk to the agent with live streaming or send it messages. You need to define how you communicate with the agent. And you need to define the security method you will use to exchange messages.
  • 9:45: Late last year, people started talking about single agents. But people were already talking about what the agent stack would be: memory, storage, observability, and so on. Now that you are talking about multi-agents or A2A, are there important things that need to be introduced to the agentic stack?
  • 10:32: You would still have the same. You’d arguably need more. Statefulness, memory, access to tools.
  • 10:48: Is that going to be like a shared memory across agents?
  • 10:52: It all depends on the architecture. The way I imagine a vanilla architecture, the user speaks to a router agent, which is the primary contact of the user with the system. That router agent does very simple things like saying “hello.” But once the user asks the system “Book me a holiday to Paris,” there are many steps involved. (No agent can do this yet). The capabilities are getting better and better. But the way I imagine it is that the router agent is the boss, and two or three remote agents do different things. One finds flights; one books hotels; one books cars—they all need information from each other. The router agent would hold the context for all of those. If you build it all within one agentic framework, it becomes even easier because those frameworks have the concepts of shared memory built in. But it’s not necessarily needed. If the hotel booking agent is built in LangChain and from a different team than the flight booking agent, the router agent would decide what information is needed.
  • 13:28: What you just said is the argument for why you need these protocols. Your example is the canonical simple example. What if my trip involves four different countries? I might need a hotel agent for every country. Because hotels might need to be specialized for local knowledge.
  • 14:12: Technically, you might not need to change agents. You need to change the data—what agent has access to what data. 
  • 14:29: We need to parallelize single agents with multi-agent systems; we move from a monolithic application to microservices that have small, dedicated agents to perform specific tasks. This has many benefits. It also makes the life of the developer easier because you can test, you can evaluate, you can perform checks before moving to production. Imagine that you gave a human 100 tools to perform a task. The human will get confused. It’s the same for agents. You need small agents with specific terms to perform the right task. 
  • 15:31: Heiko’s example drives home why something like MCP may not be enough. If you have a master agent and all it does is integrate with external sites, but the integration is not smart—if the other side has an agent, that agent could be thinking as well. While agent-to-agent is something of a science fiction at the moment, it does make sense moving forward.
  • 16:11: Coming back to Sokratis’s thought, when you give an agent too many tools and make it try to do too many things, it just becomes more and more likely that by reasoning through these tools, it will pick the wrong tool. That gets us to evaluation and fault tolerance. 
  • 16:52: At some point we might see multi-agent systems communicate with other multi-agent systems—an agent mesh.
  • 17:05: In the scenario of this hotel booking, each of the smaller agents would use their own local model. They wouldn’t all rely on a central model. Almost all frameworks allow you to choose the right model for the right task. If a task is simple but still requires an LLM, a small open source model could be sufficient. If the task requires heavy “brain” power, you might want to use Gemini 2.5 Pro.
  • 18:07: Sokratis brought up the word security. One of the earlier attacks against MCP is a scenario when an attacker buries instructions in the system prompt of the MCP server or its metadata, which then gets sent into the model. In this case, you have smaller agents, but something may happen to the smaller agents. What attack scenarios worry you at this point?
  • 19:02: There are many levels at which something might go wrong. With a single agent, you have to implement guardrails before and after each call to an LLM or agent.
  • 19:24: In a single agent, there is one model. Now each agent is using its own model. 
  • 19:35: And this makes the evaluation and security guardrails even more problematic. From A2A’s side, it supports all the different security types to authenticate agents, like API keys, HTTP authentication, OAuth 2. Within the agent card, the agent can define what you need to use to use the agent. Then you need to think of this as a service possibility. It’s not just a responsibility of the protocol. It’s the responsibility of the developer.
  • 20:29: It’s equivalent to right now with MCP. There are thousands of MCP servers. How do I know which to trust? But at the same time, there are thousands of Python packages. I have to figure out which to trust. At some level, some vetting needs to be done before you trust another agent. Is that right?
  • 21:00: I would think so. There’s a great article: “The S in MCP Stands for Security.” We can’t speak as much to the MCP protocol, but I do believe there have been efforts to implement authentication methods and address security concerns, because this is the number one question enterprises will ask. Without proper authentication and security, you will not have adoption in enterprises, which means you will not have adoption at all. WIth A2A, these concerns were addressed head-on because the A2A team understood that to get any chance of traction, built in security was priority 0. 
  • 22:25: Are you familiar with the buzzword “large action models”? The notion that your model is now multimodal and can look at screens and environment states.
  • 22:51: Within DeepMind, we have Project Mariner, which leverages Gemini’s capabilities to ask on your behalf about your computer screen.
  • 23:06: It makes sense that it’s something you want to avoid if you can. If you can do things in a headless way, why do you want to pretend you’re human? If there’s an API or integration, you would go for that. But the reality is that many tools knowledge workers use may not have these features yet. How does that impact how we build agent security? Now that people might start building agents to act like knowledge workers using screens?
  • 23:45: I spoke with a bank in the UK yesterday, and they were very clear that they need to have complete observability on agents, even if that means slowing down the process. Because of regulation, they need to be able to explain every request that went to the LLM, and every action that followed from that. I believe observability is the key in this setup, where you just cannot tolerate any errors. Because it is LLM-based, there will still be errors. But in a bank you must at least be in a position to explain exactly what happened.
  • 24:45: With most customers, whenever there’s an agentic solution, they need to share that they are using an agentic solution and the way [they] are using it is X, Y, and Z. A legal agreement is required to use the agent. The customer needs to be clear about this. There are other scenarios like UI testing where, as a developer, I want an agent to start using my machine. Or an elder who is connected with customer support of a telco to fix a router. This is impossible for a nontechnical person to achieve. The fear is there, like nuclear energy, which can be used in two different ways. It’s the same with agents and GenAI. 
  • 26:08: A2A is a protocol. As a protocol, there’s only so much you can do on the security front. At some level, that’s the responsibility of the developers. I may want to signal that my agent is secure because I’ve hired a third party to do penetration testing. Is there a way for the protocol to embed knowledge about the extra step?
  • 27:00: A protocol can’t handle all the different cases. That’s why A2A created the notion of extensions. You can extend the data structure and also the methods or the profile. Within this profile, you can say, “I want all the agents to use this encryption.” And with that, you can tell all your systems to use the same patterns. You create the extension once, you adopt that for all the A2A compatible agents, and it’s ready. 
  • 27:51: For our listeners who haven’t opened the protocol, how easy is it? Is it like REST or RPC?
  • 28:05: I personally learned it within half a day. For someone who is familiar with RPC, with traditional internet protocols, A2A is very intuitive. You have a server; you have a client. All you need to learn is some specific concepts, like the agent card. (The agent card itself could be used to signal not only my capabilities but how I have been tested. You can even think of other metrics like uptime and success rate.) You need to understand the concept of a task. And then the remote agent will update on this task as defined—for example, every five minutes or [upon] completion of specific subtasks.
  • 29:52: A2A already supports JavaScript, TypeScript, Python, Java, and .NET. In ADK, the agent development kit, with one line of code we can define a new A2A agent.
  • 30:27: What is the current state of adoption?
  • 30:40: I should have looked at the PyPI download numbers.
  • 30:49: Are you aware of teams or companies starting to use A2A?
  • 30:55: I’ve worked with a customer with an insurance platform. I don’t know anything about insurance, but there’s the broker and the underwriter, which are usually two different companies. They were thinking about building an agent for each and having the agents talk via A2A
  • 31:32: Sokratis, what about you?
  • 31:40: The interest is there for sure. Three weeks ago, I presented [at] the Google Cloud London Summit with a big customer on the integration of A2A into their agentic platform, and we shared tens of customers, including the announcement from Microsoft. Many customers start implementing agents. At some point they lack integration across business units. Now they see the more agents they build, the more the need for A2A.
  • 32:32: A2A is now in the Linux Foundation, which makes it more attractive for companies to explore, adopt, and contribute to, because it’s no longer controlled by a single entity. So decision making will be shared across multiple entities.

❌
❌