Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Generative AI in the Real World: Aurimas Griciūnas on AI Teams and Reliable AI Systems

15 January 2026 at 11:55

SwirlAI founder Aurimas Griciūnas helps tech professionals transition into AI roles and works with organizations to create AI strategy and develop AI systems. Aurimas joins Ben to discuss the changes he’s seen over the past couple years with the rise of generative AI and where we’re headed with agents. Aurimas and Ben dive into some of the differences between ML-focused workloads and those implemented by AI engineers—particularly around LLMOps and agentic workflows—and explore some of the concerns animating agent systems and multi-agent systems. Along the way, they share some advice for keeping your talent pipeline moving and your skills sharp. Here’s a tip: Don’t dismiss junior engineers.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.44
All right. So today for our first episode of this podcast in 2026, we have Aurimas Griciūnas of SwirlAI. And he was previously at Neptune.ai. Welcome to the podcast, Aurimas. 

01.02
Hi, Ben, and thank you for having me on the podcast. 

01.07
So actually, I want to start with a little bit of culture before we get into some technical things. I noticed now it seems like you’re back to teaching people some of the latest ML and AI stuff. Of course, before the advent of generative AI, the terms we were using were ML engineer, MLOps. . . Now it seems like it’s AI engineer and maybe LLMOps. I’m assuming you use this terminology in your teaching and consulting as well.

So in your mind, Aurimas, what are some of the biggest distinctions between that move from ML engineer to AI engineer, from MLOps to LLMOps? What are two to three of the biggest things that people should understand?

02.05
That’s a great question, and the answer depends on how you define AI engineering. I think how most of the people today define it is a discipline that builds systems on top of already existing large language models, maybe some fine-tuning, maybe some tinkering with the models. But it’s not about the model training. It’s about building systems or systems on top of the models that you already have.

So the distinction is quite big because we are no longer creating models. We are reusing models that we already have. And hence the discipline itself becomes a lot more similar to software engineering than actual machine learning engineering. So we are not training models. We are building on top of the models. But some of the similarities remain because both of the systems that we used to build as machine learning engineers and now we build as AI engineers are nondeterministic in their nature.

So some evaluation and practices of how we would evaluate these systems remain. In general, I would even go as far as to say that, there are more differences than similarities in these two disciplines, and it’s really, really hard to properly distinguish three main ones. Right?

03.38
So I would say software engineering, right. . . 

03.42
So, I guess, based on your description there, the personas have changed as well.

So in the previous incarnation, you had ML teams, data science teams—they were mostly the ones responsible for doing a lot of the building of the models. Now, as you point out, at most people are doing some sort of posttraining from fine-tuning. Maybe the more advanced teams are doing some sort of RL, but that’s really limited, right?

So the persona has changed. But on the other hand, at some level, Aurimas, it’s still a model, so then you still need the data scientist to interpret some of the metrics and the evals, correct? In other words, if you run with completely just “Here’s a bunch of software engineers; they’ll do everything,” obviously you can do that, but is that something you recommend without having any ML expertise in the team? 

04.51
Yes and no. A year ago or two years ago, maybe one and a half years ago, I would say that machine learning engineers were still the best fit for AI engineering roles because we were used to dealing with nondeterministic systems.

They knew how to evaluate something that the output of which is a probabilistic function. So it is more of a mindset of working with these systems and the practices that come from actually building machine learning systems beforehand. That’s very, very useful for dealing with these systems.

05.33
But nowadays, I think already many people—many specialists, many software engineers—have already tried to upskill in this nondeterminism and learn quite a lot [about] how you would evaluate these kinds of systems. And the most valuable specialist nowadays, [the one who] can actually, I would say, bring the most value to the companies building these kinds of systems is someone who can actually build end-to-end, and so has all kinds of skills, starting from being able to figure out what kind of products to build and actually implementing some POC of that product, shipping it, exposing it to the users and being able to react [to] the feedback [from] the evals that they built out for the system. 

06.30
But the eval part can be learned. Right. So you should spend some time on it. But I wouldn’t say that you need a dedicated data scientist or machine learning engineer specifically dealing with evals anymore. Two years ago, probably yes. 

06.48
So based on what you’re seeing, people are beginning to organize accordingly. In other words, the recognition here is that if you’re going to build some of these modern AI systems or agentic systems, it’s really not about the model. It’s a systems and software engineering problem. So therefore we need people who are of that mindset. 

But on the other hand, it is still data. It’s still a data-oriented system, so you might still have pipelines, right? Data pipelines to data teams that data engineers typically maintain. . . And there’s always been this lamentation even before the rise of generative AI: “Hey, these data pipelines maintained by data engineers are great, but they don’t have the same software engineering rigor that, you know, the people building web applications are used to.” What’s your sense in terms of the rigor that these teams are bringing to the table in terms of software engineering practices? 

08.09
It depends on who is building the system. AI engineers [comprise an] extremely wide range. An engineer can be an AI engineer. A software engineer could be an AI engineer, and a machine learning engineer can be an AI engineer. . .

08.31 
Let me rephrase that, Aurimas. In your mind, [on] the best teams, what’s the typical staffing pattern? 

08.39
It depends on the size of the project. If it’s just a project that’s starting out, then I would say a full stack engineer can quickly actually start off a project, build A, B, or C, and continue expanding it. And then. . .

08.59
Mainly relying on some sort of API endpoint for the model?

09.04
Not necessarily. So it can be a Rest API-based system. It can be a stream processing-based system. It can be just a CLI script. I would never encourage [anyone] to build a system which is more complex than it needs to be, because very often when you have an idea, just to prove that it works, it’s enough to build out, you know, an Excel spreadsheet with a column of inputs and outputs and then just give the outputs to the stakeholder and see if it’s useful.

So it’s not always needed to start with a Rest API. But in general, when it comes to who should start it off, I think it’s people who are very generalist. Because at the very beginning, you need to understand end to end—from product to software engineering to maintaining those systems.

10.01
But once this system evolves in complexity, then very likely the next person you would be bringing on—again, depending on the product—very likely would be someone who is good at data engineering. Because as you mentioned before, most of the systems are relying on a very high, very strong integration of these already existing data systems [that] you’re building for an enterprise, for example. And that’s a hard thing to do right. And the data engineers do it quite [well]. So definitely a very useful person to have in the team. 

10.43
And maybe eventually, once those evals come into play, depending on the complexity of the product, the team might benefit from having an ML engineer or data scientist in between. But then this is more kind of targeting those cases where the product is complex enough that you actually need some allowances for judges, and then you need to evaluate those LLMs as judges so that your evals are evaluated as well.

If you just need some simple evals—because some of them can be exact assertion-based evals—those can easily be done, I think, by someone who doesn’t have past machine learning experience.

11.36
Another cultural question I have is the following. I would say two years ago, 18 months ago, most of these AI projects were conducted. . . Basically, it was a little more decentralized, in other words. So here’s a group here. They’re going to do something. They’re going to build something on their own and then maybe try to deploy that. 

But now recently I’m hearing, Aurimas, and I don’t know if you are hearing the same thing, that, at least in some of these big companies, they’re starting to have much more of a centralized team that can help other teams.

So in other words, there’s a centralized team that somehow has the right experience and has built a few of these things. And then now they can kind of consolidate all those learnings and then help other teams. If I’m in one of these organizations, then I approach these experts. . . I guess in the old, old days—I hate this term—they would use some center of excellence kind of thing. So you will get some sort of playbook and they will help you get going. Sort of like in your previous incarnation at Neptune.ai. . . It’s almost like you had this centralized tool and experiment tracker where someone can go in and learn what others are doing and then learn from each other.

Is this something that you’re hearing that people are going for more of this kind of centralized approach? 

13.31
I do hear about these kinds of situations, but naturally, it’s always a big enterprise that’s managed to pull that off. And I believe that’s the right approach because that’s also what we have been doing before GenAI. We had those centers of excellence. . . 

13.52
I guess for our audience, explain why you think this is the right approach. 

13.58
So, two things why I think it is the right approach. The first thing is that we used to have these platform teams that would build out a shared pool of software that can be reused by other teams. So we kind of defined the standards of how these systems should be operated, and the production and the development. And they would decide what kind of technologies and tech stack should be used within the company. So I think it’s a good idea to not spread too widely in the tools that you’re using. 

Also, have template repositories that you can just pool and reuse. Because then not only is it easier to kick off and start your build out of the project, but it also helps control how well this knowledge can actually be centralized, because. . .

14.59
And also there’s security, then there’s governance as well. . . 

15.03
For example, yes. The platform side is one of those—just use the same stack and help others build it easier and faster. And the second piece is that obviously GenAI systems are still very young. So [it’s] very early and we really do not have, as some would say, enough reps in building these kinds of systems.

So we learn as we go. With regular machine learning, we already had everything figured out. We just needed some practice. Now, if we learn in this distributed way and then we do not centralize learnings, we suffer. So basically, that’s why you would have a central team that holds the knowledge. But then it should, you know, help other teams implement some new type of system and then bring those learnings back into the central core and then spread those learnings back to other teams.

But this is also how we used to operate in these platform teams in the old days, three years, four years ago. 

16.12
Right, right, right, right, right, right, right. But then, I guess, what happened with the release of generative AI is that the platform teams might have moved too slow for the rank and file. And so hence you started hearing about what they call shadow AI, where people would use tools that were not exactly blessed by the platform team. But now I think the platform teams are starting to arrest some of that. 

16.42
I wonder if it is platform teams who are kind of catching up, or is it the tools that [are] maturing and the practices that are maturing? I think we are getting more and more reps in building those systems, and now it’s easier to catch up with everything that’s going on. I would even go as far as to say it was impossible to be on top of it, and maybe it wouldn’t even make sense to have a central team.

17.10
A lot of these demos look impressive—generative AI demos, agents—but they fail when you deploy them in the wild. So in your mind, what is the single biggest hurdle or the most common reason why a lot of these demos or POCs fall short or become unreliable in production? 

17.39
That again, depends on where we are deploying the system. But one of the main reasons is that it is very easy to build a POC, and then it targets a very specific and narrow set of real-world scenarios. And we kind of believe that it solves [more than it does]. It just doesn’t generalize well to other types of scenarios. And that’s the biggest problem.

18.07
Of course there are security issues and all kinds of stability issues, even with the biggest labs and the biggest providers of LLMs, because those APIs are also not always stable, and you need to take care of that. But that’s an operational issue. I think the biggest issue is not operational. It’s actually evaluation-based, and sometimes even use case-based: Maybe the use case is not the correct one. 

18.36
You know, before the advent of generative AI, ML teams and data teams were just starting to get going on observability. And then obviously AI generative AI comes into the picture. So what changes as far as LLMs and generative AI when it comes to observability? 

19.00
I wouldn’t even call observability of regular machine learning systems and [of] AI systems the same thing.

Going back to a previous parallel, generative AI observability is a lot more similar to regular software observability. It’s all about tracing your application and then on top of those traces that you collect in the same way as you would collect from the regular software application, you add some additional metadata so that it is useful for performing evaluation actions on your agent AI type of system.

So I would even contrast machine learning observability with GenAI observability because I think these are two separate things.

19.56
Especially when it comes to agents and the agents that involve some sort of tool use, then you’re really getting into kind of software traces and software observability at that point. 

20.13
Exactly. Tool use is just a function call. A function call is just a considerable software span, let’s say. Now what’s important for GenAI is that you also know why that tool was selected to be used. And that’s where you trace outputs of your LLMs. And you know why that LLM call, that generation, has decided to use this and not the other tool.

So things like prompts, token counts, and how much time to first token it took for which generation, these kinds of things are what is additional to be traced compared to regular, software tracing. 

20.58
And then, obviously, there’s also. . . I guess one of the main changes probably this year will be multimodality, if there’s different types of modes and data involved.

21.17
Right. For some reason I didn’t touch upon that, but you’re right. There’s a lot of difference here because inputs and outputs, it’s hard. First of all, it’s hard to trace these kinds of things like, let’s say, audio input and output [or] video images. But I think [an] even harder kind of problem with this is how do you make sure that the data that you trace is useful?

Because those observability systems that are being built out, like LangSmith, Langfuse, and all of others, you know, how do you make it so that it’s convenient to actually look at the data that you trace, which is not text and not regular software spans? How [do] you build, [or] even correlate, two different audio inputs to each other? How do you do that? I don’t think that problem is solved yet. And I don’t even think that we know what we want to see when it comes to comparing this kind of data next to each other. 

22.30
So let’s talk about agents. A friend of mine actually asked me yesterday, “So, Ben, are agents real, especially on the consumer side?” And my friend was saying he doesn’t think it’s real. So I said, actually, it’s more real than people think in the following sense: First of all, deep research, that’s agents. 

And then secondly, people might be using applications that involve agents, but they don’t know it. So, for example, they’re interacting with the system and that system involves some sort of data pipeline that was written and is being monitored and maintained by an agent. Sure, the actual application is not an agent. But underneath there’s agents involved in the application

So to that extent, I think agents are definitely real in the data engineering and software engineering space. But I think there might be more consumer apps that underneath there’s some agents involved that consumers don’t know about. What’s your sense? 

23.41
Quite similar. I don’t think there are real, full-fledged agents that are exposed. 

23.44
I think people when people think of agents, they think of it as like they’re interacting with the agent directly. And that may not be the case yet. 

24.04
Right. So then, it depends on how you define the agent. Is it a fully autonomous agent? What is an agent to you? So, GenAI in general is very useful on many occasions. It doesn’t necessarily need to be a tool-using self-autonomous agent.

24.21
So like I said, the canonical example for consumers would be deep research. Those are agents.

24.27
Those are agents, that’s for sure. 

24.30
If you think of that example, it’s a bunch of agents searching across different data collections, and then maybe a central agent unifying and presenting it to the user in a coherent way.

So from that perspective, there probably are agents powering consumer apps. But they may not be the actual interface of the consumer app. So the actual interface might still be rule-based or something. 

25.07
True. Like data processing. Some automation is happening in the background. And a deep research agent, that is exposed to the user. Now that’s relatively easy to build because you don’t need to very strongly evaluate this kind of system. Because you expect the user to eventually evaluate the results. 

25.39
Or in the case of Google, you can present both: They have the AI summary, and then they still have the search results. And then based on the user signals of what the user is actually consuming, then they can continue to improve their deep research agent. 

25.59
So let’s say the disasters that can happen from wrong results were not that bad. Right? So. 

26.06
Oh, no, it can be bad if you deploy it inside the enterprise, and you’re using it to prepare your CFO for some earnings call, right?

26.17
True, true. But then you know whose responsibility is it? The agent’s, that provided 100%…? 

26.24
You can argue that’s still an agent, but then the finance team will take those results and scrutinize [them] and make sure they’re correct. But an agent prepared the initial version. 

26.39
Exactly, exactly. So it still needs review.

26.42
Yeah. So the reason I bring up agents is, do agents change anything from your perspective in terms of eval, observability, and anything else? 

26.55
They do a little bit, compared to agent workflows that are not, full agents, the only change that really happens. . . And we are talking now about multi-agent systems, where multiple agents can be chained or looped in together. So really the only difference there is that the length of the trace is not deterministic. And the amount of spans is not deterministic. So in the sense of observability itself, the difference is minimal as long as those agents and multi-agent systems are running in a single runtime.

27.44
Now, when it comes to evals and evaluation, it is different because you evaluate different aspects of the system. You try to discover different patterns of failures. As an example, if you’re just running your agent workflow, then you know what kind of steps can be taken, and then you can be almost 100% sure that the entire path from your initial intent to the final answer is completed. 

Now with agent systems and multi-agent systems, you can still achieve, let’s say, input-output. But then what happens in the middle is not a black box, but it is very nondeterministic. Your agents can start looping the same questions between each other. So you need to also look for failure signals that are not present in agentic workflows, like too many back-and-forth [responses] between the agents, which wouldn’t happen in a regular agentic workflow.

Also, for tool use and planning, you need to figure out if the tools are being executed in the correct order. And similar things. 

29.09
And that’s why I think in that scenario, you definitely need to collect fine-grained traces, because there’s also the communication between the agents. One agent might be lying to another agent about the status of completion and so on and so forth. So you need to really kind of have granular level traces at that point. Right? 

29.37
I would even say that you always need to have written the lower-level pieces, even if you’re running a simple RAG system, which you will learn by the generation system, you still need those granular traces for each of the actions.

29.52
But definitely, interagent communication introduces more points of failure that you really need to make sure that you also capture. 

So in closing, I guess, this is a fast-moving field, right? So there’s the challenge for you, the individual, for your professional development. But then there’s also the challenge for you as an AI team in how you keep up. So any tips at both the individual level and at the team level, besides going to SwirlAI and taking courses? [laughs] What other practical tips would you give an individual in the team? 

30.47
So for individuals, for sure, learn fundamentals. Don’t rely on frameworks alone. Understand how everything is really working under the hood; understand how those systems are actually connected.

Just think about how those prompts and context [are] actually glued together and passed from an agent to an agent. Do not think that you will be able to just mount a framework right on top of your system, write [a] few prompts, and everything will magically work. You need to understand how the system works from the first principles.

So yeah. Go deep. That’s for individual practitioners. 

31.32
When it comes to teams, well, that’s a very good question and a very hard question. Because, you know, in the upcoming one or two years, everything can change so much. 

31.44
And then one of the challenges, Aurimas, for example, in the data engineering space. . . It used to be, several years ago, I have a new data engineer in the team. I have them build some basic pipelines. Then they get confident, [and] then they build more complex pipelines and so on and so forth. And then that’s how you get them up to speed and get them more experience.

But the challenge now is a lot of those basic pipelines can be built with agents, and so there’s some amount of entry-level work that used to be the place where you can train your entry-level people. Those are disappearing, which also impacts your talent pipeline. If you don’t have people at the beginning, then you won’t have experienced people later on.

So any tips for teams and the challenge of the pipeline for talent?

32.56
That’s such a hard question. I would like to say, do not dismiss junior engineers. Train them. . .

33.09
Oh, I yeah, I agree completely. I agree completely.

33.14
But that’s a hard decision to make, right? Because you need to be thinking about the future.

33.26
I think, Aurimas, the mindset people have to [have is to] say, okay, so the traditional training grounds we had, in this example of the data engineer, were these basic pipelines. Those are gone. Well, then we find a different way for them to enter. It might be they start managing some agents instead of building pipelines from scratch. 

33.56
We’ll see. We’ll see. But we don’t know. 

33.58
Yeah. Yeah. We don’t know. The agents even in the data engineering space are still human-in-the-loop. So in other words a human still needs to monitor [them] and make sure they’re working. So that could be the entry-level for junior data engineers. Right? 

34.13
Right. But you know that’s the hard part about this question. Then answer is, that could be, but we do not know, and for now maybe it doesn’t make sense. . .

34.28
My point is that if you stop hiring these juniors, I think that’s going to hurt you down the road. So you just hired a junior and hired the junior and then stick them in a different track, and then, as you say, things might change, but then they can adapt. If you hire the right people, they will be able to adapt. 

34.50
I agree, I agree, but then, there are also people who are potentially not right for that role, let’s say, and you know, what I. . . 

35.00
But that’s true even when you hired them and you assigned them to build pipelines. So same thing, right? 

35.08
The same thing. But the thing I see with the juniors and less senior people who are currently building is that we are relying too much on vibe coding. I would also suggest looking for some ways on how to onboard someone new and make sure that the person actually learns the craft and not just comes in and vibe codes his or her way around, making more issues for senior engineers then actually helps. 

35.50
Yeah, this is a big topic, but one of the challenges, all I can say is that, you know, the AI tools are getting better at coding at some level because the people building these models are using reinforcement learning and the signal in reinforcement learning is “Does the code run?” So then what people are ending up with now with this newer generation of these models is [that] they vibe code and they will get code that runs because that’s what the reinforcement learning is optimizing for.

But that doesn’t mean that that code doesn’t introduce proper to the right. But on the face of it, it’s running, right? An experienced person obviously can probably handle that. 

But anyway, so last word, you get the last word, but take us on a positive note. 

36.53
[laughs] I do believe that the future is bright. It’s not grim, not dark. I am very excited about what is happening in the AI space. I do believe that it will not be as fast. . . All this AGI and AI taking over human jobs, it will not happen as fast as everyone is saying. So you shouldn’t be worried about that, especially when it comes to enterprises. 

I believe that we already had [very powerful] technology one or one and a half years ago. [But] for enterprises to even utilize that kind of technology, which we already had one and a half years ago, will still take another five years or so to fully actually get the most out of it. So there will be enough work and jobs for at least the upcoming 10 years. And I think, people should not be worried too much about it.

38.06
But in general, eventually, even the ones who will lose their jobs will probably respecialize in that long period of time to some more valuable role. 

38.18
I guess I will close with the following advice: The main thing that you can do is just keep using these tools and keep learning. I think the distinction will be increasingly between those who know how to use these tools well and those who do not.

And with that, thank you, Aurimas.

💾

Generative AI in the Real World: The LLMOps Shift with Abi Aryan

20 November 2025 at 07:16

MLOps is dead. Well, not really, but for many the job is evolving into LLMOps. In this episode, Abide AI founder and LLMOps author Abi Aryan joins Ben to discuss what LLMOps is and why it’s needed, particularly for agentic AI systems. Listen in to hear why LLMOps requires a new way of thinking about observability, why we should spend more time understanding human workflows before mimicking them with agents, how to do FinOps in the age of generative AI, and more.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right, so today we have Abi Aryan. She is the author of the O’Reilly book on LLMOps as well as the founder of Abide AI. So, Abi, welcome to the podcast. 

00.19: Thank you so much, Ben. 

00.21: All right. Let’s start with the book, which I confess, I just cracked open: LLMOps. People probably listening to this have heard of MLOps. So at a high level, the models have changed: They’re bigger, they’re generative, and so on and so forth. So since you’ve written this book, have you seen a wider acceptance of the need for LLMOps? 

00.51: I think more recently there are more infrastructure companies. So there was a conference happening recently, and there was this sort of perception or messaging across the conference, which was “MLOps is dead.” Although I don’t agree with that. 

There’s a big difference that companies have started to pick up on more recently, as the infrastructure around the space has sort of started to improve. They’re starting to realize how different the pipelines were that people managed and grew, especially for the older companies like Snorkel that were in this space for years and years before large language models came in. The way they were handling data pipelines—and even the observability platforms that we’re seeing today—have changed tremendously.

01.40: What about, Abi, the general. . .? We don’t have to go into specific tools, but we can if you want. But, you know, if you look at the old MLOps person and then fast-forward, this person is now an LLMOps person. So on a day-to-day basis [has] their suite of tools changed? 

02.01: Massively. I think for an MLOps person, the focus was very much around “This is my model. How do I containerize my model, and how do I put it in production?” That was the entire problem and, you know, most of the work was around “Can I containerize it? What are the best practices around how I arrange my repository? Are we using templates?” 

Drawbacks happened, but not as much because most of the time the stuff was tested and there was not too much indeterministic behavior within the models itself. Now that has changed.

02.38: [For] most of the LLMOps engineers, the biggest job right now is doing FinOps really, which is controlling the cost because the models are massive. The second thing, which has been a big difference, is we have shifted from “How can we build systems?” to “How can we build systems that can perform, and not just perform technically but perform behaviorally as well?”: “What is the cost of the model? But also what is the latency? And see what’s the throughput looking like? How are we managing the memory across different tasks?” 

The problem has really shifted when we talk about it. . . So a lot of focus for MLOps was “Let’s create fantastic dashboards that can do everything.” Right now it’s no matter which dashboard you create, the monitoring is really very dynamic. 

03.32: Yeah, yeah. As you were talking there, you know, I started thinking, yeah, of course, obviously now the inference is essentially a distributed computing problem, right? So that was not the case before. Now you have different phases even of the computation during inference, so you have the prefill phase and the decode phase. And then you might need different setups for those. 

So anecdotally, Abi, did the people who were MLOps people successfully migrate themselves? Were they able to upskill themselves to become LLMOps engineers?

04.14: I know a couple of friends who were MLOps engineers. They were teaching MLOps as well—Databricks folks, MVPs. And they were now transitioning to LLMOps.

But the way they started is they started focusing very much on, “Can you do evals for these models? They weren’t really dealing with the infrastructure side of it yet. And that was their slow transition. And right now they’re very much at that point where they’re thinking, “OK, can we make it easy to just catch these problems within the model—inferencing itself?”

04.49: A lot of other problems still stay unsolved. Then the other side, which was like a lot of software engineers who entered the field and became AI engineers, they have a much easier transition because software. . . The way I look at large language models is not just as another machine learning model but literally like software 3.0 in that way, which is it’s an end-to-end system that will run independently.

Now, the model isn’t just something you plug in. The model is the product tree. So for those people, most software is built around these ideas, which is, you know, we need a strong cohesion. We need low coupling. We need to think about “How are we doing microservices, how the communication happens between different tools that we’re using, how are we calling up our endpoints, how are we securing our endpoints?”

Those questions come easier. So the system design side of things comes easier to people who work in traditional software engineering. So the transition has been a little bit easier for them as compared to people who were traditionally like MLOps engineers. 

05.59: And hopefully your book will help some of these MLOps people upskill themselves into this new world.

Let’s pivot quickly to agents. Obviously it’s a buzzword. Just like anything in the space, it means different things to different teams. So how do you distinguish agentic systems yourself?

06.24: There are two words in the space. One is agents; one is agent workflows. Basically agents are the components really. Or you can call them the model itself, but they’re trying to figure out what you meant, even if you forgot to tell them. That’s the core work of an agent. And the work of a workflow or the workflow of an agentic system, if you want to call it, is to tell these agents what to actually do. So one is responsible for execution; the other is responsible for the planning side of things. 

07.02: I think sometimes when tech journalists write about these things, the general public gets the notion that there’s this monolithic model that does everything. But the reality is, most teams are moving away from that design as you, as you describe.

So they have an agent that acts as an orchestrator or planner and then parcels out the different steps or tasks needed, and then maybe reassembles in the end, right?

07.42: Coming back to your point, it’s now less of a problem of machine learning. It’s, again, more like a distributed systems problem because we have multiple agents. Some of these agents will have more load—they will be the frontend agents, which are communicating to a lot of people. Obviously, on the GPUs, these need more distribution.

08.02: And when it comes to the other agents that may not be used as much, they can be provisioned based on “This is the need, and this is the availability that we have.” So all of that provisioning again is a problem. The communication is a problem. Setting up tests across different tasks itself within an entire workflow, now that becomes a problem, which is where a lot of people are trying to implement context engineering. But it’s a very complicated problem to solve. 

08.31: And then, Abi, there’s also the problem of compounding reliability. Let’s say, for example, you have an agentic workflow where one agent passes off to another agent and yet to another third agent. Each agent may have a certain amount of reliability, but it compounds over time. So it compounds across this pipeline, which makes it more challenging. 

09.02: And that’s where there’s a lot of research work going on in the space. It’s an idea that I’ve talked about in the book as well. At that point when I was writing the book, especially chapter four, in which a lot of these were described, most of the companies right now are [using] monolithic architecture, but it’s not going to be able to sustain as we go towards application.

We have to go towards a microservices architecture. And the moment we go towards microservices architecture, there are a lot of problems. One will be the hardware problem. The other is consensus building, which is. . . 

Let’s say you have three different agents spread across three different nodes, which would be running very differently. Let’s say one is running on an edge one hundred; one is running on something else. How can we achieve consensus if even one of the nodes ends up winning? So that’s open research work [where] people are trying to figure out, “Can we achieve consensus in agents based on whatever answer the majority is giving, or how do we really think about it?” It should be set up at a threshold at which, if it’s beyond this threshold, then you know, this perfectly works.

One of the frameworks that is trying to work in this space is called MassGen—they’re working on the research side of solving this problem itself in terms of the tool itself. 

10.31: By the way, even back in the microservices days in software architecture, obviously people went overboard too. So I think that, as with any of these new things, there’s a bit of trial and error that you have to go through. And the better you can test your systems and have a setup where you can reproduce and try different things, the better off you are, because many times your first stab at designing your system may not be the right one. Right? 

11.08: Yeah. And I’ll give you two examples of this. So AI companies tried to use a lot of agentic frameworks. You know people have used Crew; people have used n8n, they’ve used. . . 

11.25: Oh, I hate those! Not I hate. . . Sorry. Sorry, my friends and crew. 

11.30: And 90% of the people working in this space seriously have already made that transition, which is “We are going to write it ourselves. 

The same happened for evaluation: There were a lot of evaluation tools out there. What they were doing on the surface is literally just tracing, and tracing wasn’t really solving the problem—it was just a beautiful dashboard that doesn’t really serve much purpose. Maybe for the business teams. But at least for the ML engineers who are supposed to debug these problems and, you know, optimize these systems, essentially, it was not giving much other than “What is the error response that we’re getting to everything?”

12.08: So again, for that one as well, most of the companies have developed their own evaluation frameworks in-house, as of now. The people who are just starting out, obviously they’ve done. But most of the companies that started working with large language models in 2023, they’ve tried every tool out there in 2023, 2024. And right now more and more people are staying away from the frameworks and launching and everything.

People have understood that most of the frameworks in this space are not superreliable.

12.41: And [are] also, honestly, a bit bloated. They come with too many things that you don’t need in many ways. . .

12:54: Security loopholes as well. So for example, like I reported one of the security loopholes with LangChain as well, with LangSmith back in 2024. So those things obviously get reported by people [and] get worked on, but the companies aren’t really proactively working on closing those security loopholes. 

13.15: Two open source projects that I like that are not specifically agentic are DSPy and BAML. Wanted to give them a shout out. So this point I’m about to make, there’s no easy, clear-cut answer. But one thing I noticed, Abi, is that people will do the following, right? I’m going to take something we do, and I’m going to build agents to do the same thing. But the way we do things is I have a—I’m just making this up—I have a project manager and then I have a designer, I have role B, role C, and then there’s certain emails being exchanged.

So then the first step is “Let’s replicate not just the roles but kind of the exchange and communication.” And sometimes that actually increases the complexity of the design of your system because maybe you don’t need to do it the way the humans do it. Right? Maybe if you go to automation and agents, you don’t have to over-anthropomorphize your workflow. Right. So what do you think about this observation? 

14.31: A very interesting analogy I’ll give you is people are trying to replicate intelligence without understanding what intelligence is. The same for consciousness. Everybody wants to replicate and create consciousness without understanding consciousness. So the same is happening with this as well, which is we are trying to replicate a human workflow without really understanding how humans work.

14.55: And sometimes humans may not be the most efficient thing. Like they exchange five emails to arrive at something. 

15.04: And humans are never context defined. And in a very limiting sense. Even if somebody’s job is to do editing, they’re not just doing editing. They are looking at the flow. They are looking for a lot of things which you can’t really define. Obviously you can over a period of time, but it needs a lot of observation to understand. And that skill also depends on who the person is. Different people have different skills as well. Most of the agentic systems right now, they’re just glorified Zapier IFTTT routines. That’s the way I look at them right now. The if recipes: If this, then that.

15.48: Yeah, yeah. Robotic process automation I guess is what people call it. The other thing that people I don’t think understand just reading the popular tech press is that agents have levels of autonomy, right? Most teams don’t actually build an agent and unleash it full autonomous from day one.

I mean, I guess the analogy would be in self-driving cars: They have different levels of automation. Most enterprise AI teams realize that with agents, you have to kind of treat them that way too, depending on the complexity and the importance of the workflow. 

So you go first very much a human is involved and then less and less human over time as you develop confidence in the agent.

But I think it’s not good practice to just kind of let an agent run wild. Especially right now. 

16.56: It’s not, because who’s the person answering if the agent goes wrong? And that’s a question that has come up often. So this is the work that we’re doing at Abide really, which is trying to create a decision layer on top of the knowledge retrieval layer.

17.07: Most of the agents which are built using just large language models. . . LLMs—I think people need to understand this part—are fantastic at knowledge retrieval, but they do not know how to make decisions. If you think agents are independent decision makers and they can figure things out, no, they cannot figure things out. They can look at the database and try to do something.

Now, what they do may or may not be what you like, no matter how many rules you define across that. So what we really need to develop is some sort of symbolic language around how these agents are working, which is more like trying to give them a model of the world around “What is the cause and effect, with all of these decisions that you’re making? How do we prioritize one decision where the. . .? What was the reasoning behind that so that entire decision making reasoning here has been the missing part?”

18.02: You brought up the topic of observability. There’s two schools of thought here as far as agentic observability. The first one is we don’t need new tools. We have the tools. We just have to apply [them] to agents. And then the second, of course, is this is a new situation. So now we need to be able to do more. . . The observability tools have to be more capable because we’re dealing with nondeterministic systems.

And so maybe we need to capture more information along the way. Chains of decision, reasoning, traceability, and so on and so forth. Where do you fall in this kind of spectrum of we don’t need new tools or we need new tools? 

18.48: We don’t need new tools, but we certainly need new frameworks, and especially a new way of thinking. Observability in the MLOps world—fantastic; it was just about tools. Now, people have to stop thinking about observability as just visibility into the system and start thinking of it as an anomaly detection problem. And that was something I’d written in the book as well. Now it’s no longer about “Can I see what my token length is?” No, that’s not enough. You have to look for anomalies at every single part of the layer across a lot of metrics. 

19.24: So your position is we can use the existing tools. We may have to log more things. 

19.33: We may have to log more things, and then start building simple ML models to be able to do anomaly detection. 

Think of managing any machine, any LLM model, any agent as really like a fraud detection pipeline. So every single time you’re looking for “What are the simplest signs of fraud?” And that can happen across various factors. But we need more logging. And again you don’t need external tools for that. You can set up your own loggers as well.

Most of the people I know have been setting up their own loggers within their companies. So you can simply use telemetry to be able to a.) define a set and use the general logs, and b.) be able to define your own custom logs as well, depending on your agent pipeline itself. You can define “This is what it’s trying to do” and log more things across those things, and then start building small machine learning models to look for what’s going on over there.

20.36: So what is the state of “Where we are? How many teams are doing this?” 

20.42: Very few. Very, very few. Maybe just the top bits. The ones who are doing reinforcement learning training and using RL environments, because that’s where they’re getting their data to do RL. But people who are not using RL to be able to retrain their model, they’re not really doing much of this part; they’re still depending very much on external accounts.

21.12: I’ll get back to RL in a second. But one topic you raised when you pointed out the transition from MLOps to LLMOps was the importance of FinOps, which is, for our listeners, basically managing your cloud computing costs—or in this case, increasingly mastering token economics. Because basically, it’s one of these things that I think can bite you.

For example, the first time you use Claude Code, you go, “Oh, man, this tool is powerful.” And then boom, you get an email with a bill. I see, that’s why it’s powerful. And you multiply that across the board to teams who are starting to maybe deploy some of these things. And you see the importance of FinOps.

So where are we, Abi, as far as tooling for FinOps in the age of generative AI and also the practice of FinOps in the age of generative AI? 

22.19: Less than 5%, maybe even 2% of the way there. 

22:24: Really? But obviously everyone’s aware of it, right? Because at some point, when you deploy, you become aware. 

22.33: Not enough people. A lot of people just think about FinOps as cloud, basically the cloud cost. And there are different kinds of costs in the cloud. One of the things people are not doing enough is not profiling their models properly, which is [determining] “Where are the costs really coming from? Our models’ compute power? Are they taking too much RAM? 

22.58: Or are we using reasoning when we don’t need it?

23.00: Exactly. Now that’s a problem we solve very differently. That’s where yes, you can do kernel fusion. Define your own custom kernels. Right now there’s a massive number of people who think we need to rewrite kernels for everything. It’s only going to solve one problem, which is the compute-bound problem. But it’s not going to solve the memory-bound problem. Your data engineering pipelines aren’t what’s going to solve your memory-bound problems.

And that’s where most of the focus is missing. I’ve mentioned it in the book as well: Data engineering is the foundation of first being able to solve the problems. And then we moved to the compute-bound problems. Do not start optimizing the kernels over there. And then the third part would be the communication-bound problem, which is “How do we make these GPUs talk smarter with each other? How do we figure out the agent consensus and all of those problems?”

Now that’s a communication problem. And that’s what happens when there are different levels of bandwidth. Everybody’s dealing with the internet bandwidth as well, the kind of serving speed as well, different kinds of cost and every kind of transitioning from one node to another. If we’re not really hosting our own infrastructure, then that’s a different problem, because it depends on “Which server do you get assigned your GPUs on again?”

24.20: Yeah, yeah, yeah. I want to give a shout out to Ray—I’m an advisor to Anyscale—because Ray basically is built for these sorts of pipelines because it can do fine-grained utilization and help you decide between CPU and GPU. And just generally, you don’t think that the teams are taking token economics seriously?

I guess not. How many people have I heard talking about caching, for example? Because if it’s a prompt that [has been] answered before, why do you have to go through it again? 

25.07: I think plenty of people have started implementing KV caching, but they don’t really know. . . Again, one of the questions people don’t understand is “How much do we need to store in the memory itself, and how much do we need to store in the cache?” which is the big memory question. So that’s the one I don’t think people are able to solve. A lot of people are storing too much stuff in the cache that should actually be stored in the RAM itself, in the memory.

And there are generalist applications that don’t really understand that this agent doesn’t really need access to the memory. There’s no point. It’s just lost in the throughput really. So I think the problem isn’t really caching. The problem is that differentiation of understanding for people. 

25.55: Yeah, yeah, I just threw that out as one element. Because obviously there’s many, many things to mastering token economics. So you, you brought up reinforcement learning. A few years ago, obviously people got really into “Let’s do fine-tuning.” But then they quickly realized. . . And actually fine-tuning became easy because basically there became so many services where you can just focus on labeled data. You upload your labeled data, boom, come back from lunch, you have a fine-tuned model.

But then people realize that “I fine-tuned, but the model that results isn’t really as good as my fine-tuning data.” And then obviously RAG and context engineering came into the picture. Now it seems like more people are again talking about reinforcement learning, but in the context of LLMs. And there’s a lot of libraries, many of them built on Ray, for example. But it seems like what’s missing, Abi, is that fine-tuning got to the point where I can sit down a domain expert and say, “Produce labeled data.” And basically the domain expert is a first-class participant in fine-tuning.

As best I can tell, for reinforcement learning, the tools aren’t there yet. The UX hasn’t been figured out in order to bring in the domain experts as the first-class citizen in the reinforcement learning process—which they need to be because a lot of the stuff really resides in their brain. 

27.45: The big problem here, and very, very much to the point of what you pointed out, is the tools aren’t really there. And one very specific thing I can tell you is most of the reinforcement learning environments that you’re seeing are static environments. Agents are not learning statically. They are learning dynamically. If your RL environment cannot adapt dynamically, which basically in 2018, 2019, emerged as the OpenAI Gym and a lot of reinforcement learning libraries were coming out.

28.18: There is a line of work called curriculum learning, which is basically adapting your model’s difficulty to the results itself. So basically now that can be used in reinforcement learning, but I’ve not seen any practical implementation of using curriculum learning for reinforcement learning environments. So people create these environments—fantastic. They work well for a little bit of time, and then they become useless.

So that’s where even OpenAI, Anthropic, those companies are struggling as well. They’ve paid heavily in contracts, which are yearlong contracts to say, “Can you build this vertical environment? Can you build that vertical environment?” and that works fantastically But once the model learns on it, then there’s nothing else to learn. And then you go back into the question of, “Is this data fresh? Is this adaptive with the world?” And it becomes the same RAG problem over again. 

29.18: So maybe the problem is with RL itself. Maybe maybe we need a different paradigm. It’s just too hard. 

Let me close by looking to the future. The first thing is—the space is moving so hard, this might be an impossible question to ask, but if you look at, let’s say, 6 to 18 months, what are some things in the research domain that you think are not being talked enough about that might produce enough practical utility that we will start hearing about them in 6 to 12, 6 to 18 months?

29.55: One is how to profile your machine learning models, like the entire systems end-to-end. A lot of people do not understand them as systems, but only as models. So that’s one thing which will make a massive amount of difference. There are a lot of AI engineers today, but we don’t have enough system design engineers.

30.16: This is something that Ion Stoica at Sky Computing Lab has been giving keynotes about. Yeah. Interesting. 

30.23: The second part is. . . I’m optimistic about seeing curriculum learning applied to reinforcement learning as well, where our RL environments can adapt in real time so when we train agents on them, they are dynamically adapting as well. That’s also [some] of the work being done by labs like Circana, which are working in artificial labs, artificial light frame, all of that stuff—evolution of any kind of machine learning model accuracy. 

30.57: The third thing where I feel like the communities are falling behind massively is on the data engineering side. That’s where we have massive gains to get. 

31.09: So on the data engineering side, I’m happy to say that I advise several companies in the space that are completely focused on tools for these new workloads and these new data types. 

Last question for our listeners: What mindset shift or what skill do they need to pick up in order to position themselves in their career for the next 18 to 24 months?

31.40: For anybody who’s an AI engineer, a machine learning engineer, an LLMOps engineer, or an MLOps engineer, first learn how to profile your models. Start picking up Ray very quickly as a tool to just get started on, to see how distributed systems work. You can pick the LLM if you want, but start understanding distributed systems first. And once you start understanding those systems, then start looking back into the models itself. 

32.11: And with that, thank you, Abi.

💾

Generative AI in the Real World: Chris Butler on GenAI in Product Management

30 October 2025 at 07:29

In this episode, Ben Lorica and Chris Butler, director of product operations for GitHub’s Synapse team, chat about the experimentation Chris is doing to incorporate generative AI into the product development process—particularly with the goal of reducing toil for cross-functional teams. It isn’t just automating busywork (although there’s some of that). He and his team have created agents that expose the right information at the right time, use feedback in meetings to develop “straw man” prototypes for the team to react to, and even offer critiques from specific perspectives (a CPO agent?). Very interesting stuff.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: Today we have Chris Butler of GitHub, where he leads a team called the Synapse. Welcome to the podcast, Chris. 

00.15: Thank you. Yeah. Synapse is actually part of our product team and what we call EPD operations, which is engineering, product, and design. And our team is mostly engineers. I’m the product lead for it, but we help solve and reduce toil for these cross-functional teams inside of GitHub, mostly building internal tooling, with the focus on process automation and AI. But we also have a speculative part of our practice as well: trying to imagine the future of cross-functional teams working together and how they might do that with agents, for example.

00.45: Actually, you are the first person I’ve come across who’s used the word “toil.” Usually “tedium” is what people use, in terms of describing the parts of their job that they would rather automate. So you’re actually a big proponent of talking about agents that go beyond coding agents.

01.03: Yeah. That’s right. 

01.05: And specifically in your context for product people. 

01.09: And actually, for just the way that, say, product people work with their cross-functional teams. But I would also include other types of functions, legal privacy and customer support docs, any of these people that are working to actually help build a product; I think there needs to be a transformation of the way we think about these tools.

01.29: GitHub is a very engineering-led organization as well as a very engineering-focused organization. But my role is to really think about “How do we do a better job between all these people that I would call nontechnical—but they are sometimes technical, of course, but the people that are not necessarily there to write code. . . How do we actually work together to build great products?” And so that’s what I think about work. 

01.48: For people who aren’t familiar with product management and product teams, what’s toil in the context of product teams? 

02.00: So toil is actually something that I stole from a Google SRE from the standpoint of any type of thing that someone has to do that is manual, tactical, repetitive. . . It usually doesn’t really add to the value of the product in any way. It’s something that as the team gets bigger or the product goes down the SDLC or lifecycle, it scales linearly, with the fact that you’re building bigger and bigger things. And so it’s usually something that we want to try to cut out, because not only is it potentially a waste of time, but there’s also a perception within the team it can cause burnout.

02.35: If I have to constantly be doing toilsome parts of my work, I feel I’m doing things that don’t really matter rather than focusing on the things that really matter. And what I would argue is especially for product managers and cross-functional teams, a lot of the time that is processes that they have to use, usually to share information within larger organizations.

02.54: A good example of that is status reporting. Status reporting is one of those things where people will spend anywhere from 30 minutes to hours per week. And sometimes it’s in certain parts of the team—technical product managers, product managers, engineering managers, program managers are all dealing with this aspect that they have to in some way summarize the work that the team is doing and then shar[e] that not only with their leadership. . . They want to build trust with their leadership, that they’re making the right decisions, that they’re making the right calls. They’re able to escalate when they need help. But also then to convey information to other teams that are dependent on them or they’re dependent on. Again, this is [in] very large organizations, [where] there’s a huge cost to communication flows.

03.35: And so that’s why I use status reporting as a good example of that. Now with the use of the things like LLMs, especially if we think about our LLMs as a compression engine or a translation engine, we can then start to use these tools inside of these processes around status reporting to make it less toilsome. But there’s still aspects of it that we want to keep that are really about humans understanding, making decisions, things like that. 

03:59: And this is key. So one of the concerns that people have is about a hollowing out in the following context: If you eliminate toil in general, the problem there is that your most junior or entry-level employees actually learn about the culture of the organization by doing toil. There’s some level of toil that becomes part of the onboarding in the acculturation of young employees. But on the other hand, this is a challenge for organizations to just change how they onboard new employees and what kinds of tasks they give them and how they learn more about the culture of the organization.

04.51: I would differentiate between the idea of toil and paying your dues within the organization. In investment banking, there’s a whole concern about that: “They just need to sit in the office for 12 hours a day to really get the culture here.” And I would differentiate that from. . .

05:04: Or “Get this slide to pitch decks and make sure all the fonts are the right fonts.”

05.11: That’s right. Yeah, I worked at Facebook Reality Labs, and there were many times where we would do a Zuck review, and getting those slides perfect was a huge task for the team. What I would say is I want to differentiate this from the gaining of expertise. So if we think about Gary Klein, naturalistic decision making, real expertise is actually about being able to see an environment. And that could be a data environment [or] information environment as well. And then as you gain expertise, you’re able to discern between important signals and noise. And so what I’m not advocating for is to remove the ability to gain that expertise. But I am saying that toilsome work doesn’t necessarily contribute to expertise. 

05.49: In the case of status reporting as an example—status reporting is very valuable for a person to be able to understand what is going on with the team, and then, “What actions do I need to take?” And we don’t want to remove that. But the idea that a TPM or product manager or EM has to dig through all of the different issues that are inside of a particular repo to look for specific updates and then do their own synthesis of a draft, I think there is a difference there. And so what I would say is that the idea of me reading this information in a way that is very convenient for me to consume and then to be able to shape the signal that I then put out into the organization as a status report, that is still very much a human decision.

06.30: And I think that’s where we can start to use tools. Ethan Mollick has talked about this a lot in the way that he’s trying to approach including LLMs in, say, the classroom. There’s two patterns that I think could come out of this. One is that when I have some type of early draft of something, I should be able to get a lot of early feedback that is very low reputational risk. And what I mean by that is that a bot can tell me “Hey, this is not written in a way with the active voice” or “[This] is not really talking about the impact of this on the organization.” And so I can get that super early feedback in a way that is not going to hurt me.

If I publish a really bad status report, people may think less of me inside the organization. But using a bot or an agent or just a prompt to even just say, “Hey, these are the ways you can improve this”—that type of early feedback is really, really valuable. That I have a draft and I get critique from a bunch of different viewpoints I think is super valuable and will build expertise.

07.24: And then there’s the other side, which is, when we talk about consuming lots of information and then synthesizing or translating it into a draft, I can then critique “Is this actually valuable to the way that I think that this leader thinks? Or what I’m trying to convey as an impact?” And so then I am critiquing the straw man that is output by these prompts and agents.

07.46: Those two different patterns together actually create a really great loop for me to be able to learn not only from agents but also from the standpoint of seeing how. . . The part that ends up being really exciting is when once you start to connect the way communication happens inside the organization, I can then see what my leaders passed on to the next leader or what this person interpreted this as. And I can use that as a feedback loop to then improve, over time, my expertise in, say, writing a status report that is shaped for the leader. There’s also a whole thing that when we talk about status reporting in particular, there is a difference in expertise that people are getting that I’m not always 100%. . .

08.21: It’s valuable for me to understand how my leader thinks and makes decisions. I think that is very valuable. But the idea that I will spend hours and hours shaping and formulating a status report from my point of view for someone else can be aided by these types of systems. And so status should not be about the speaker’s mouth; it should be at the listener’s ear.

For these leaders, they want to be able to understand “Are the teams making the right decisions? Do I trust them? And then where should I preemptively intervene because of my experience or maybe my understanding of the context in the broader organization?” And so that’s what I would say: These tools are very valuable in helping build that expertise.

09.00: It’s just that we have to rethink “What is expertise?” And I just don’t buy it that paying your dues is the way you gain expertise. You do sometimes. Absolutely. But a lot of it is also just busy work and toil. 

09.11: My thing is these are productivity tools. And so you make even your junior employees productive—you just change the way you use your more-junior employees. 

09.24: Maybe just one thing to add to this is that there is something really interesting inside of the education world of using LLMs: trying to understand where someone is at. And so the type of feedback that someone that is very early in their career or first to doing something is potentially very different in the way that you’re teaching them or giving them feedback versus something that someone that is much further in expertise, they want to be able to just get down to “What are some things I’m missing here? Where am I biased?” Those are things where I think we also need to do a better job for those early employees, the people that are just starting to get expertise—“How do we train them using these tools as well as other ways?”

10.01: And I’ve done that as well. I do a lot of learning and development help, internal to companies, and I did that as part of the PM faculty for learning in development at Google. And so thinking a lot about how PMs gain expertise, I think we’re doing a real disservice to making it so that product manager as a junior position is so hard to get.

10.18: I think it’s really bad because, right out of college, I started doing program management, and it taught me so much about this. But at Microsoft, when I joined, we would say that the program manager wasn’t really worth very much for the first two years, right? Because they’re gaining expertise in this.

And so I think LLMs can help give the ability for people to gain expertise faster and also help them from avoiding making errors that other people might make. But I think there’s a lot to do with just learning and development in general that we need to pair with LLMs and human systems.

10.52: In terms of agents, I guess agents for product management, first of all, do they exist? And if they do, I always like to look at what level of autonomy they really have. Most agents really are still partially autonomous, right? There’s still a human in the loop. And so the question is “How much is the human in the loop?” It’s kind of like a self-driving car. There’s driver assists, and then there’s all the way to self-driving. A lot of the agents right now are “driver assist.” 

11.28: I think you’re right. That’s why I don’t always use the term “agent,” because it’s not an autonomous system that is storing memory using tools, constantly operating.

I would argue though that there is no such thing as “human out of the loop.” We’re probably just drawing the system diagram wrong if we’re saying that there’s no human that’s involved in some way. That’s the first thing. 

11.53: The second thing I’d say is that I think you’re right. A lot of the time right now, it ends up being when the human needs the help, we end up creating systems inside of GitHub; we have something that’s called GitHub spaces, which is really a custom GPT. It’s really just a bundling of context that I can then go to when I need help with a particular type of thing. We built very highly specific types of copilot spaces, like “I need to write a blog announcement about something. And so what’s the GitHub writing style? How should I be wording this avoiding jargon?” Internal things like that. So it can be highly specific. 

We also have more general tools that are kind of like “How do I form and maintain initiatives throughout the entire software development lifecycle? When do I need certain types of feedback? When do I need to generate the 12 to 14 different documents that compliance and downstream teams need?” And so those tend to be operating in the background to autodraft these things based on the context that’s available. And so that’s I’d say that’s semiagentic, to a certain extent. 

12.52: But I think actually there’s really big opportunities when it comes to. . . One of the cases that we’re working on right now is actually linking information in the GitHub graph that is not commonly linked. And so a key example of that might be kicking off all of the process that goes along with doing a release. 

When I first get started, I actually want to know in our customer feedback repo, in all the different places where we store customer feedback, “Where are there times that customers actually asked about this or complained about it or had some information about this?” And so when I get started, being able to automatically link something like a release tracking issue with all of this customer feedback becomes really valuable. But it’s very hard for me as an individual to do that. And what we really want—and what we’re building—[are] things that are more and more autonomous about constantly searching for feedback or information that we can then connect to this release tracking issue.

13.44: So that’s why I say we’re starting to get into the autonomous realm when it comes to this idea of something going around looking for linkages that don’t exist today. And so that’s one of those things, because again, we’re talking about information flow. And a lot of the time, especially in organizations the size of GitHub, there’s lots of siloing that takes place.

We have lots of repos. We have lots of information. And so it’s really hard for a single person to ever keep all of that in their head and to know where to go, and so [we’re] bringing all of that into the tools that they end up using. 

14.14: So for example, we’ve also created internal things—these are more assist-type use cases—but the idea of a Gemini Gem inside of a Google doc or an M365 agent inside of Word that is then also connected to the GitHub graph in some way. I think it’s “When do we expose this information? Is it always happening in the background, or is it only when I’m drafting the next version of this initiative that ends up becoming really, really important?”

14.41: Some of the work we’ve been experimenting with is actually “How do we start to include agents inside of the synchronous meetings that we actually do?” You probably don’t want an agent to suddenly start speaking, especially because there’s lots of different agents that you may want to have in a meeting.

We don’t have a designer on our team, so I actually end up using an agent that is prompted to be like a designer and think like a designer inside of these meetings. And so we probably don’t want them to speak up dynamically inside the meeting, but we do want them to add information if it’s helpful. 

We want to autoprototype things as a straw man for us to be able to react to. We want to start to use our planning agents and stuff like that to help us plan out “What is the work that might need to take place?” It’s a lot of experimentation about “How do we actually pull things into the places that humans are doing the work?”—which is usually synchronous meetings, some types of asynchronous communication like Teams or Slack, things like that.

15.32: So that’s where I’d say the full possibility [is] for, say, a PM. And our customers are also TPMs and leaders and people like that. It really has to do with “How are we linking synchronous and asynchronous conversations with all of this information that is out there in the ecosystem of our organization that we don’t know about yet, or viewpoints that we don’t have that we need to have in this conversation?”

15.55: You mentioned the notion of a design agent passively in the background, attending a meeting. This is fascinating. So this design agent, what is it? Is it a fine-tuned agent or. . .? What exactly makes it a design agent? 

16.13: In this particular case, it’s a specific prompt that defines what a designer would usually do in a cross-functional team and what they might ask questions about, what they would want clarification of. . .

16.26: Completely reliant on the pretrained foundation model—no posttraining, no RAG, nothing? 

16.32: No, no. [Everything is in the prompt] at this point. 

16.36: How big is this prompt? 

16.37: It’s not that big. I’d say it’s maybe at most 50 lines, something like that. It’s pretty small. The truth is, the idea of a designer is something that LLMs know about. But more for our specific case, right now it’s really just based on this live conversation. And there’s a lot of papercuts in the way that we have to do a site call, pull a live transcript, put it into a space, and [then] I have a bunch of different agents that are inside the space that will then pipe up when they have something interesting to say, essentially.

And it’s a little weird because I have to share my screen and people have to read it, hold the meeting. So it’s clunky right now in the way that we bring this in. But what it will bring up is “Hey, these are patterns inside of design that you may want to think about.” Or you know, “For this particular part of the experience, it’s still pretty ambiguous. Do you want to define more about what this part of the process is?” And we’ve also included legal, privacy, data-oriented groups. Even the idea of a facilitator agent saying that we were getting off track or we have these other things to discuss, that type of stuff. So again, these are really rudimentary right now.

17.37: Now, what I could imagine though is, we have a design system inside of GitHub. How might we start to use that design system and use internal prototyping tools to autogenerate possibilities for what we’re talking about? And I guess when I think about using prototyping as a PM, I don’t think the PMs should be vibe coding everything.

I don’t think the prototype replaces a lot of the cross-functional documents that we have today. But I think what it does increase is that if we have been talking about a feature for about 30 minutes, that is a lot of interesting context that if we can say, “Autogenerate three different prototypes that are coming from slightly different directions, slightly different places that we might integrate inside of our current product,” I think what it does is it gives us, again, that straw man for us to be able to critique, which will then uncover additional assumptions, additional values, additional principles that we maybe haven’t written down somewhere else.

18.32: And so I see that as super valuable. And that’s the thing that we end up doing—we’ll use an internal product for prototyping to just take that and then have it autogenerated. It takes a little while right now, you know, a couple minutes to do a prototype generation. And so in those cases we’ll just [say], “Here’s what we thought about so far. Just give us a prototype.” And again it doesn’t always do the right thing, but at least it gives us something to now talk about because it’s more real now. It is not the thing that we end up implementing, but it is the thing that we end up talking about. 

18.59: By the way, this notion of an agent attending synchronous some meeting, you can imagine taking it to the next level, which is to take advantage of multimodal models. The agent can then absorb speech and maybe visual cues, so then basically when the agent suggests something and someone reacts with a frown. . . 

19.25: I think there’s something really interesting about that. And when you talk about multimodal, I do think that one of the things that is really important about human communication is the way that we pick up cues from each other—if we think about it, the reason why we actually talk to each other. . . And there’s a great book called The Enigma of Reason that’s all about this.

But their hypothesis is that, yes, we can try to logic or pretend to logic inside of our own heads, but we actually do a lot of post hoc analysis. So we come up with an idea inside our head. We have some certainty around it, some intuition, and then we fit it to why we thought about this. So that’s what we do internally. 

But when you and I are talking, I’m actually trying to read your mind in some way. I’m trying to understand the norms that are at play. And I’m using your facial expression. I’m using your tone of voice. I’m using what you’re saying—actually way less of what you’re saying and more your facial expression and your tone of voice—to determine what’s going on.

20.16: And so I think this idea of engagement with these tools and the way these tools work, I think [of] the idea of gaze tracking: What are people looking at? What are people talking about? How are people reacting to this? And then I think this is where in the future, in some of the early prototypes we built internally for what the synchronous meeting would look like, we have it where the agent is raising its hand and saying, “Here’s an issue that we may want to discuss.” If the people want to discuss it, they can discuss it, or they can ignore it. 

20.41: Longer term, we have to start to think about how agents are fitting into the turn-taking of conversation with the rest of the group. And using all of these multimodal cues ends up being very interesting, because you wouldn’t want just an agent whenever it thinks of something to just blurt it out.

20.59: And so there’s a lot of work to do here, but I think there’s something really exciting about just using engagement as the meaning to understand what are the hot topics, but also trying to help detect “Are we rat-holing on something that should be put in the parking lot?” Those are things and cues that we can start to get from these systems as well.

21.16: By the way, context has multiple dimensions. So you can imagine in a meeting between the two of us, you outrank me. You’re my manager. But then it turns out the agent realizes, “Well, actually, looking through the data in the company, Ben knows more about this topic than Chris. So maybe when I start absorbing their input, I should weigh Ben’s, even though in the org chart Chris outranks Ben.” 

21.46: A related story is one of the things I’ve created inside of a copilot space is actually a proxy for our CPO. And so what I’ve done is I’ve taken meetings that he’s done where he asked questions in a smaller setting, taking his writing samples and things that, and I’ve tried to turn it into a, not really an agent, but a space where I can say, “Here’s what I’m thinking about for this plan. And what would Mario [Rodriguez] potentially think about this?” 

It’s definitely not 100% accurate in any way. Mario’s an individual that is constantly changing and is learning and has intuitions that he doesn’t say out loud, but it is interesting how it does sound like him. It does seem to focus on questions that he would bring up in a previous meeting based on the context that we provided. And so I think to your point, a lot of things that right now are said inside of meetings that we then don’t use to actually help understand people’s points of view in a deeper way.

22.40: You could imagine that this proxy also could be used for [determining] potential blind spots for Mario that, as a person that is working on this, I may need to deal with, in the sense that maybe he’s not always focused on this type of issue, but I think it’s a really big deal. So how do I help him actually understand what’s going on?

22.57: And this gets back to that reporting: Is that the listener’s ear? What does that person actually care about? What do they need to know about to build trust with the team? What do they need to take action on? Those are things that I think we can start to build interesting profiles. 

There’s a really interesting ethical question, which is: Should that person be able to write their own proxy? Would it include the blind spots that they have or not? And then maybe compare this to—you know, there’s [been] a trend for a little while where every leader would write their own user manual or readme, and inside of those things, they tend to be a bit more performative. It’s more about how they idealize their behavior versus the way that they actually are.

23.37: And so there’s some interesting problems that start to come up when we’re doing proxying. I don’t call it a digital twin of a person, because digital twins to me are basically simulations of mechanical things. But to me it’s “What is this proxy that might sit in this meeting to help give us a perspective and maybe even identify when this is something we should escalate to that person?”

23.55: I think there’s lots of very interesting things. Power structures inside of the organization are really hard to discern because there’s both, to your point, hierarchical ones that are very set in the systems that are there, but there’s also unsaid ones. 

I mean, one funny story is Ray Dalio did try to implement this inside of his hedge fund. And unfortunately, I guess, for him, there were two people that were considered to be higher ranking in reputation than him. But then he changed the system so that he was ranked number one. So I guess we have to worry about this type of thing for these proxies as well. 

24.27: One of the reasons why coding is such a great playground for these things is one, you can validate the result. But secondly, the data is quite tame and relatively right. So you have version control systems GitHub—you can look through that and say, “Hey, actually Ben’s commits are much more valuable than Chris’s commits.” Or “Ben is the one who suggested all of these changes before, and they were all accepted. So maybe we should really take Ben’s opinion much more strong[ly].” I don’t know what artifacts you have in the product management space that can help develop this reputation score.

25.09: Yeah. It’s tough because a reputation score, especially once you start to monitor some type of metric and it becomes the goal, that’s where we get into problems. For example, Agile teams adopting velocity as a metric: It’s meant to be an internal metric that helps us understand “If this person is out, how does that adjust what type of work we need to do?” But then comparing velocities between different teams ends up creating a whole can of worms around “Is this actually the metric that we’re trying to optimize for?”

25.37: And even when it comes to product management, what I would say is actually valuable a lot of the time is “Does the team understand why they’re working on something? How does it link to the broader strategy? How does this solve both business and customer needs? And then how are we wrangling this uncertainty of the world?” 

I would argue that a really key meta skill for product managers—and for other people like generative user researchers, business development people, you know, even leaders inside the organization—they have to deal with a lot of uncertainty. And it’s not that we need to shut down the uncertainty, because actually uncertainty is an advantage that we should take advantage of and something we should use in some way. But there are places where we need to be able to build enough certainty for the team to do their work and then make plans that are resilient in the future uncertainty. 

26.24: And then finally, the ability to communicate what the team is doing and why it’s important is very valuable. Unfortunately, there’s not a lot of. . . Maybe there’s rubrics we can build. And that’s actually what career ladders try to do for product managers. But they tend to be very vague actually. And as you get more senior inside of a product manager organization, you start to see things—it’s really just broader views, more complexity. That’s really what we start to judge product managers on. Because of that fact, it’s really about “How are you working across the team?”

26.55: There will be cases, though, that we can start to say, “Is this thing thought out well enough at first, at least for the team to be able to take action?” And then linking that work as a team to outcomes ends up being something that we can apply more and more data rigor to. But I worry about it being “This initiative brief was perfect, and so that meant the success of the product,” when the reality was that was maybe the starting point, but there was all this other stuff that the product manager and the team was doing together. So I’m always wary of that. And that’s where performance management for PMs is actually pretty hard: where you have to base most of your understanding on how they work with the other teammates inside their team.

27.35: You’ve been in product for a long time so you have a lot of you have a network of peers in other companies, right? What are one or two examples of the use of AI—not in GitHub—in the product management context that you admire? 

27.53: For a lot of the people that I know that are inside of startups that are basically using prototyping tools to build out their initial product, I have a lot of, not necessarily envy, but I respect that a lot because you have to be so scrappy inside of a startup, and you’re really there to not only prove something to a customer, or actually not even prove something, but get validation from customers that you’re building the right thing. And so I think that type of rapid prototyping is something that is super valuable for that stage of an organization.

28.26: When I start to then look at larger enterprises, what I do see that I think is not as well a help with these prototyping tools is what we’ll call brownfield development: We need to build something on top of this other thing. It’s actually hard to use these tools today to imagine new things inside of a current ecosystem or a current design system.

28.46: [For] a lot of the teams that are in other places, it really is a struggle to get access to some of these tools. The thing that’s holding back the biggest enterprises from actually doing interesting work in this area is they’re overconstraining what their engineers [and] product managers can use as far as these tools.

And so what’s actually being created is shadow systems, where the person is using their personal ChatGPT to actually do the work rather than something that’s within the compliance of the organization.

29:18: Which is great for IP protection. 

29:19: Exactly! That’s the problem, right? Some of this stuff, you do want to use the most current tools. Because there is actually not just [the] time savings aspect and toil reduction aspects—there’s also just the fact that it helps you think differently, especially if you’re an expert in your domain. It really aids you in becoming even better at what you’re doing. And then it also shores up some of your weaknesses. Those are the things that really expert people are using these types of tools for. But in the end, it comes down to a combination of legal, HR, and IT, and budgetary types of things too, that are holding back some of these organizations.

30.00: When I’m talking to other people inside of the orgs. . . Maybe another problem for enterprises right now is that a lot of these tools require lots of different context. We’ve benefited inside of GitHub in that a lot of our context is inside the GitHub graph, so Copilot can access it and use it. But for other teams they keep things and all of these individual vendor platforms.

And so the biggest problem then ends up being “How do we merge these different pieces of context in a way that is allowed?” When I first started working in the team of Synapse, I looked at the patterns that we were building and it was like “If we just had access to Zapier or Relay or something like that, that is exactly what we need right now.” Except we would not have any of the approvals for the connectors to all of these different systems. And so Airtable is a great example of something like that too: They’re building out process automation platforms that focus on data as well as connecting to other data sources, plus the idea of including LLMs as components inside these processes.

30.58: A really big issue I see for enterprises in general is the connectivity issue between all the datasets. And there are, of course, teams that are working on this—Glean or others that are trying to be more of an overall data copilot frontend for your entire enterprise datasets. But I just haven’t seen as much success in getting all these connected. 

31.17: I think one of the things that people don’t realize is enterprise search is not turnkey. You have to get in there and really do all these integrations. There’s no shortcuts. There’s no, if a vendor comes to you and says, yeah, just use our system, it all magically works.

31.37: This is why we need to hire more people with degrees in library science, because they actually know how to manage these types of systems. Again, my first cutting my teeth on this was in very early versions of SharePoint a long time ago. And even inside there, there’s so much that you need to do to just help people with not only organization of the data but even just the search itself.

It’s not just a search index problem. It’s a bunch of different things. And that’s why whenever we’re shown an empty text box, that’s why there’s so much work that goes into just behind that; inside of Google, all of the instant answers, there’s lots of different ways that a particular search query is actually looked at, not just to go against the search index but to also just provide you the right information. And now they’re trying to include Gemini by default in there. The same thing happens within any copilot. There’s a million different things you could use. 

32.27: And so I guess maybe this gets to my hypothesis about the way that agents will be valuable, either fully autonomous ones or ones that are attached to a particular process. But having many different agents that are highly biased in a particular way. And I use the term bias as in bias can be good, neutral, and bad, right? I don’t mean bias in a way of unfairness and that type of stuff; I mean more from the standpoint of “This agent is meant to represent this viewpoint, and it’s going to give you feedback from this viewpoint.” That ends up becoming really, really valuable because of that fact that you will not always be thinking about everything. 

33.00: I’ve done a lot of work in adversarial thinking and red teaming and stuff like that. One of the things that is most valuable is to build prompts that are breaking the sycophancy of these different models that are there by default, because it should be about challenging my thinking rather than just agreeing with it.

And then the standpoint of each one of these highly biased agents actually helps provide a very interesting approach. I mean, if we go to things like meeting facilitation or workshop facilitation groups, this is why. . . I don’t know if you’re familiar with the six hats, but the six hats is a technique by which we declare inside of a meeting that I’m going to be the one that’s all positivity. This person’s going to be the one about data. This person’s gonna be the one that’s the adversarial, negative one, etc., etc. When you have all of these different viewpoints, you actually end up because of the tensions in the discussion of those ideas, the creation of options, the weighing of options, I think you end up making much better decisions. That’s where I think those highly biased viewpoints end up becoming really valuable. 

34.00: For product people who are early in their career or want to enter the field, what are some resources that they should be looking at in terms of leveling up on the use AI in this context?

34.17: The first thing is there are millions of prompt libraries out there for product managers. What you should do is when you are creating work, you should be using a lot of these prompts to give you feedback, and you can actually even write your own, if you want to. But I would say there’s lots of material out there for “I need to write this thing.”

What is a way to [do something like] “I try to write it and then I get critique”? But then how might this AI system, through a prompt, generate a draft of this thing? And then I go in and look at it and say, “Which things are not actually quite right here?” And I think that again, those two patterns of getting critique and giving critique end up building a lot of expertise.

34.55: I think also within the organization itself, I believe an awful lot in things that are called basically “learning from your peers.” Being able to join small groups where you are getting feedback from your peers and including AI agent feedback inside of the small peer groups is very valuable. 

There’s another technique, which is using case studies. And I actually, as part of my learning development practice, do something called “decision forcing cases” where we take a story that actually happened, we walk people through it and we ask them, “What do they think is happening; what would they do next?” But having that where you do those types of things across junior and senior people, you can start to actually learn the expertise from the senior people through these types of case studies.

35.37: I think there’s an awful lot more that senior leaders inside the organization should be doing. And as junior people inside your organization, you should be going to these senior leaders and saying, “How do you think about this? What is the way that you make these decisions?” Because what you’re actually pulling from is their past experience and expertise that they’ve gained to build that intuition.

35.53: There’s all sorts of surveys of programmers and engineers and AI. Are there surveys about product managers? Are they freaked out or what? What’s the state of adoption and this kind of thing? 

36.00: Almost every PM that I’ve met has used an LLM in some way, to help them with their writing in particular. And if you look at the studies by ChatGPT or OpenAI about the use of ChatGPT, a lot of the writing tasks end up being from a product manager or senior leader standpoint. I think people are freaked out because every practice says that this other practice is going to be replaced because I can in some way replace them right now with a viewpoint.

36.38: I don’t think product management will go away. We may change the terminology that we end up using. But this idea of someone that is helping manage the complexity of the team, help with communication, help with [the] decision-making process inside that team is still very valuable and will be valuable even when we can start to autodraft a PRD.

I would argue that the draft of the PRD is not what matters. It’s actually the discussions that take place in the team after the PRD is created. And I don’t think that designers are going to take over the PM work because, yes, it is about to a certain extent the interaction patterns and the usability of things and the design and the feeling of things. But there’s all these other things that you need to worry about when it comes to matching it to business models, matching it to customer mindsets, deciding which problems to solve. They’re doing that. 

37.27: There’s a lot of this concern about [how] every practice is saying this other practice is going to go away because of AI. I just don’t think that’s true. I just think we’re all going to be given different levels of abstraction to gain expertise on. But the core of what we do—an engineer focusing on what is maintainable and buildable and actually something that we want to work on versus the designer that’s building something usable and something that people will feel good using, and a product manager making sure that we’re actually building the thing that is best for the company and the user—those are things that will continue to exist even with these AI tools, prototyping tools, etc.

38.01: And for our listeners, as Chris mentioned, there’s many, many prompt templates for product managers. We’ll try to get Chris to recommend one, and we’ll put it in the episode notes. [See “Resources from Chris” below.] And with that thank you, Chris. 

38.18: Thank you very much. Great to be here.

Resources from Chris

Here’s what Chris shared with us following the recording:

There are two [prompt resources for product managers] that I think people should check out:

However, I’d say that people should take these as a starting point and they should adapt them for their own needs. There is always going to be nuance for their roles, so they should look at how people do the prompting and modify for their own use. I tend to look at other people’s prompts and then write my own.

If they are thinking about using prompts frequently, I’d make a plug for Copilot Spaces to pull that context together.

💾

Generative AI in the Real World: Context Engineering with Drew Breunig

16 October 2025 at 07:18

In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.

00.23: Thanks, Ben. Thanks for having me on here. 

00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today? 

00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models. 

And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. 

02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt. 

But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse. 

And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. 

03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window. 

04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing. 

04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific. 

04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. 

05.14: And then I think one of the very immediate lessons that people realize is, just because. . . 

So one of the things that these model providers do when they have a model release,  one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.

05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope. 

And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . .  Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.

07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.

And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge. 

08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model. 

08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.

Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . 

09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem. 

Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.

09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.

And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. 

10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context. 

11.08: This is a really big problem. 

11.09: So this is true even if the context window’s small actually. 

11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform. 

A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.

12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.” 

The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. 

But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. 

13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.

I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting. 

14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”

OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt. 

15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails. 

16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over. 

17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.

18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.

I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.

19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?

19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. 

19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .

19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like. 

I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. 

20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he. 

21.15: This is back in the Objective-C days. . .

21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert. 

22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data. 

22.37: Throw in BAML?

22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.

22.52: BAML and TextGrad? 

22.55: TextGrad is more like the prompt optimization I’m talking about. 

22:57: TextGrad plus GEPA plus Regolo?

23.02: Yeah, those things are really important. And the reason I say they’re important is. . .

23.08: I mean, Drew, those are kind of advanced topics. 

23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.

23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .

23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. 

23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.

24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.

And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. 

25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” 

25.16: In the loop. . .

25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .

25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. 

25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?

Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. 

26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.

27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me. 

27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.

28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat. 

29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.

We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.

30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.

Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.

31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience. 

31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? 

31.57: [laughs] I mean, I wish you had given me some time to think about it. 

31.58: We are in a hype cycle here. . .

32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different. 

I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. 

But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.

33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces. 

She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant. 

34.34: That’s my handle on Twitter. 

34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.

And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”

35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.

And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.

36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. 

I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. 

37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way. 

37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties. 

37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.

37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . . 

38.00: Although I think it’s too boring a term, Drew, to take off.

38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things. 

But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .

39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.

40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging. 

And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term. 

40.56: Now with that. Thank you, Drew. 

40.58: Awesome. Thank you for having me, Ben.

💾

❌
❌