Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Generative AI in the Real World: Context Engineering with Drew Breunig

16 October 2025 at 07:18

In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.

00.23: Thanks, Ben. Thanks for having me on here. 

00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today? 

00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models. 

And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. 

02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt. 

But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse. 

And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. 

03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window. 

04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing. 

04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific. 

04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. 

05.14: And then I think one of the very immediate lessons that people realize is, just because. . . 

So one of the things that these model providers do when they have a model release,  one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.

05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope. 

And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . .  Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.

07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.

And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge. 

08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model. 

08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.

Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . 

09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem. 

Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.

09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.

And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. 

10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context. 

11.08: This is a really big problem. 

11.09: So this is true even if the context window’s small actually. 

11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform. 

A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.

12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.” 

The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. 

But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. 

13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.

I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting. 

14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”

OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt. 

15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails. 

16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over. 

17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.

18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.

I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.

19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?

19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. 

19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .

19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like. 

I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. 

20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he. 

21.15: This is back in the Objective-C days. . .

21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert. 

22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data. 

22.37: Throw in BAML?

22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.

22.52: BAML and TextGrad? 

22.55: TextGrad is more like the prompt optimization I’m talking about. 

22:57: TextGrad plus GEPA plus Regolo?

23.02: Yeah, those things are really important. And the reason I say they’re important is. . .

23.08: I mean, Drew, those are kind of advanced topics. 

23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.

23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .

23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. 

23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.

24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.

And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. 

25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” 

25.16: In the loop. . .

25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .

25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. 

25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?

Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. 

26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.

27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me. 

27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.

28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat. 

29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.

We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.

30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.

Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.

31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience. 

31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? 

31.57: [laughs] I mean, I wish you had given me some time to think about it. 

31.58: We are in a hype cycle here. . .

32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different. 

I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. 

But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.

33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces. 

She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant. 

34.34: That’s my handle on Twitter. 

34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.

And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”

35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.

And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.

36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. 

I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. 

37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way. 

37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties. 

37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.

37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . . 

38.00: Although I think it’s too boring a term, Drew, to take off.

38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things. 

But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .

39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.

40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging. 

And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term. 

40.56: Now with that. Thank you, Drew. 

40.58: Awesome. Thank you for having me, Ben.

💾

Working with Contexts

28 August 2025 at 06:02

The following article comes from two blog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.”

Managing Your Context Is the Key to Successful Agents

As frontier model context windows continue to grow,1 with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows will unlock the agents of our dreams. After all, with a large enough window, you can simply throw everything into a prompt you might need—tools, documents, instructions, and more—and let the model take care of the rest.

Long contexts kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents.2

But in reality, longer contexts do not generate better responses. Overloading your context can cause your agents and applications to fail in surprising ways. Contexts can become poisoned, distracting, confusing, or conflicting. This is especially problematic for agents, which rely on context to gather information, synthesize findings, and coordinate actions.

Let’s run through the ways contexts can get out of hand, then review methods to mitigate or entirely avoid context fails.

Context Poisoning

Context poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.

The DeepMind team called out context poisoning in the Gemini 2.5 technical report, which we broke down previously. When playing Pokémon, the Gemini agent would occasionally hallucinate, poisoning its context:

An especially egregious form of this issue can take place with “context poisoning”—where many parts of the context (goals, summary) are “poisoned” with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals.

If the “goals” section of its context was poisoned, the agent would develop nonsensical strategies and repeat behaviors in pursuit of a goal that cannot be met.

Context Distraction

Context distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.

As context grows during an agentic workflow—as the model gathers more information and builds up history—this accumulated context can become distracting rather than helpful. The Pokémon-playing Gemini agent demonstrated this problem clearly:

While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multistep, generative reasoning.

Instead of using its training to develop new strategies, the agent became fixated on repeating past actions from its extensive context history.

For smaller models, the distraction ceiling is much lower. A Databricks study found that model correctness began to fall around 32k for Llama 3.1-405b and earlier for smaller models.

If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization3 and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.

Context Confusion

Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.

For a minute there, it really seemed like everyone was going to ship an MCP. The dream of a powerful model, connected to all your services and stuff, doing all your mundane tasks felt within reach. Just throw all the tool descriptions into the prompt and hit go. Claude’s system prompt showed us the way, as it’s mostly tool definitions or instructions for using tools.

But even if consolidation and competition don’t slow MCPscontext confusion will. It turns out there can be such a thing as too many tools.

The Berkeley Function-Calling Leaderboard is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its third version, the leaderboard shows that every model performs worse when provided with more than one tool.4 Further, the Berkeley team, “designed scenarios where none of the provided functions are relevant…we expect the model’s output to be no function call.” Yet, all models will occasionally call tools that aren’t relevant.

Browsing the function-calling leaderboard, you can see the problem get worse as the models get smaller:

Tool-calling irrelevance score for Gemma models (chart from dbreunig.com, source: Berkeley Function-Calling Leaderboard; created with Datawrapper)

A striking example of context confusion can be seen in a recent paper that evaluated small model performance on the GeoEngine benchmark, a trial that features 46 different tools. When the team gave a quantized (compressed) Llama 3.1 8b a query with all 46 tools, it failed, even though the context was well within the 16k context window. But when they only gave the model 19 tools, it succeeded.

The problem is, if you put something in the context, the model has to pay attention to it. It may be irrelevant information or needless tool definitions, but the model will take it into account. Large models, especially reasoning models, are getting better at ignoring or discarding superfluous context, but we continually see worthless information trip up agents. Longer contexts let us stuff in more info, but this ability comes with downsides.

Context Clash

Context clash is when you accrue new information and tools in your context that conflicts with other information in the context.

This is a more problematic version of context confusion. The bad context here isn’t irrelevant, it directly conflicts with other information in the prompt.

A Microsoft and Salesforce team documented this brilliantly in a recent paper. The team took prompts from multiple benchmarks and “sharded” their information across multiple prompts. Think of it this way: Sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory. The Microsoft/Salesforce team modified benchmark prompts to look like these multistep exchanges:

Microsoft/Salesforce team benchmark prompts

All the information from the prompt on the left side is contained within the several messages on the right side, which would be played out in multiple chat rounds.

The sharded prompts yielded dramatically worse results, with an average drop of 39%. And the team tested a range of models—OpenAI’s vaunted o3’s score dropped from 98.1 to 64.1.

What’s going on? Why are models performing worse if information is gathered in stages rather than all at once?

The answer is context confusion: The assembled context, containing the entirety of the chat exchange, contains early attempts by the model to answer the challenge before it has all the information. These incorrect answers remain present in the context and influence the model when it generates its final answer. The team writes:

We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

This does not bode well for agent builders. Agents assemble context from documents, tool calls, and from other models tasked with subproblems. All of this context, pulled from diverse sources, has the potential to disagree with itself. Further, when you connect to MCP tools you didn’t create there’s a greater chance their descriptions and instructions clash with the rest of your prompt.

Learnings

The arrival of million-token context windows felt transformative. The ability to throw everything an agent might need into the prompt inspired visions of superintelligent assistants that could access any document, connect to every tool, and maintain perfect memory.

But, as we’ve seen, bigger contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes agents to lean heavily on their context and repeat past actions rather than push forward. Context confusion leads to irrelevant tool or document usage. Context clash creates internal contradictions that derail reasoning.

These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon: gathering information from multiple sources, making sequential tool calls, engaging in multi-turn reasoning, and accumulating extensive histories.

Fortunately, there are solutions!

Mitigating and Avoiding Context Failures

Let’s run through the ways we can mitigate or avoid context failures entirely.

Everything is about information management. Everything in the context influences the response. We’re back to the old programming adage of “garbage in, garbage out.” Thankfully, there’s plenty of options for dealing with the issues above.

RAG

Retrieval-augmented generation (RAG) is the act of selectively adding relevant information to help the LLM generate a better response.

Because so much has been written about RAG, we’re not going to cover it here beyond saying: It’s very much alive.

Every time a model ups the context window ante, a new “RAG is dead” debate is born. The last significant event was when Llama 4 Scout landed with a 10 million token window. At that size, it’s really tempting to think, “Screw it, throw it all in,” and call it a day.

But, as we’ve already covered, if you treat your context like a junk drawer, the junk will influence your response. If you want to learn more, here’s a new course that looks great.

Tool Loadout

Tool loadout is the act of selecting only relevant tool definitions to add to your context.

The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round. Usually, your loadout is tailored to the context—the character, the level, the rest of your team’s makeup, and your own skill set. Here, we’re borrowing the term to describe selecting the most relevant tools for a given task.

Perhaps the simplest way to select tools is to apply RAG to your tool descriptions. This is exactly what Tiantian Gan and Qiyao Sun did, which they detail in their paper “RAG MCP.” By storing their tool descriptions in a vector database, they’re able to select the most relevant tools given an input prompt.

When prompting DeepSeek-v3, the team found that selecting the right tools becomes critical when you have more than 30 tools. Above 30, the descriptions of the tools begin to overlap, creating confusion. Beyond 100 tools, the model was virtually guaranteed to fail their test. Using RAG techniques to select fewer than 30 tools yielded dramatically shorter prompts and resulted in as much as 3x better tool selection accuracy.

For smaller models, the problems begin long before we hit 30 tools. One paper we touched on previously, “Less is More,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 tools, but succeeds when given only 19 tools. The issue is context confusion, not context window limitations.

To address this issue, the team behind “Less is More” developed a way to dynamically select tools using an LLM-powered tool recommender. The LLM was prompted to reason about “number and type of tools it ‘believes’ it requires to answer the user’s query.” This output was then semantically searched (tool RAG, again) to determine the final loadout. They tested this method with the Berkeley Function-Calling Leaderboard, finding Llama 3.1 8b performance improved by 44%.

The “Less is More” paper notes two other benefits to smaller contexts—reduced power consumption and speed—crucial metrics when operating at the edge (meaning, running an LLM on your phone or PC, not on a specialized server). Even when their dynamic tool selection method failed to improve a model’s result, the power savings and speed gains were worth the effort, yielding savings of 18% and 77%, respectively.

Thankfully, most agents have smaller surface areas that only require a few hand-curated tools. But if the breadth of functions or the amount of integrations needs to expand, always consider your loadout.

Context Quarantine

Context quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.

We see better results when our contexts aren’t too long and don’t sport irrelevant content. One way to achieve this is to break our tasks up into smaller, isolated jobs—each with its own context.

There are many examples of this tactic, but an accessible write-up of this strategy is Anthropic’s blog post detailing its multi-agent research system. They write:

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.

Research lends itself to this design pattern. When given a question, multiple agents can identify and separately prompt several subquestions or areas of exploration. This not only speeds up the information gathering and distillation (if there’s compute available), but it keeps each context from accruing too much information or information not relevant to a given prompt, delivering higher quality results:

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single-agent system failed to find the answer with slow, sequential searches.

This approach also helps with tool loadouts, as the agent designer can create several agent archetypes with their own dedicated loadout and instructions for how to utilize each tool.

The challenge for agent builders, then, is to find opportunities for isolated tasks to spin out onto separate threads. Problems that require context-sharing among multiple agents aren’t particularly suited to this tactic.

If your agent’s domain is at all suited to parallelization, be sure to read the whole Anthropic write-up. It’s excellent.

Context Pruning

Context pruning is the act of removing irrelevant or otherwise unneeded information from the context.

Agents accrue context as they fire off tools and assemble documents. At times, it’s worth pausing to assess what’s been assembled and remove the cruft. This could be something you task your main LLM with or you could design a separate LLM-powered tool to review and edit the context. Or you could choose something more tailored to the pruning task.

Context pruning has a (relatively) long history, as context lengths were a more problematic bottleneck in the natural language processing (NLP) field prior to ChatGPT. Building on this history, a current pruning method is Provence, “an efficient and robust context pruner for question answering.”

Provence is fast, accurate, simple to use, and relatively small—only 1.75 GB. You can call it in a few lines, like so:

from transformers import AutoModel

provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)

# Read in a markdown version of the Wikipedia entry for Alameda, CA
with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
    alameda_wiki = f.read()

# Prune the article, given a question
question = 'What are my options for leaving Alameda?'
provence_output = provence.process(question, alameda_wiki)

Provence edited the article, cutting 95% of the content, leaving me with only this relevant subset. It nailed it.

One could employ Provence or a similar function to cull documents or the entire context. Further, this pattern is a strong argument for maintaining a structured5 version of your context in a dictionary or other form, from which you assemble a compiled string prior to every LLM call. This structure would come in handy when pruning, allowing you to ensure the main instructions and goals are preserved while the document or history sections can be pruned or summarized.

Context Summarization

Context summarization is the act of boiling down an accrued context into a condensed summary.

Context summarization first appeared as a tool for dealing with smaller context windows. As your chat session came close to exceeding the maximum context length, a summary would be generated and a new thread would begin. Chatbot users did this manually in ChatGPT or Claude, asking the bot to generate a short recap that would then be pasted into a new session.

However, as context windows increased, agent builders discovered there are benefits to summarization besides staying within the total context limit. As we’ve seen, beyond 100,000 tokens the context becomes distracting and causes the agent to rely on its accumulated history rather than training. Summarization can help it “start over” and avoid repeating context-based actions.

Summarizing your context is easy to do, but hard to perfect for any given agent. Knowing what information should be preserved and detailing that to an LLM-powered compression step is critical for agent builders. It’s worth breaking out this function as its own LLM-powered stage or app, which allows you to collect evaluation data that can inform and optimize this task directly.

Context Offloading

Context offloading is the act of storing information outside the LLM’s context, usually via a tool that stores and manages the data.

This might be my favorite tactic, if only because it’s so simple you don’t believe it will work.

Again, Anthropic has a good write-up of the technique, which details their “think” tool, which is basically a scratchpad:

With the “think” tool, we’re giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer… This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.

I really appreciate the research and other writing Anthropic publishes, but I’m not a fan of this tool’s name. If this tool were called scratchpad, you’d know its function immediately. It’s a place for the model to write down notes that don’t cloud its context and are available for later reference. The name “think” clashes with “extended thinking” and needlessly anthropomorphizes the model… but I digress.

Having a space to log notes and progress works. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains: up to a 54% improvement against a benchmark for specialized agents.

Anthropic identified three scenarios where the context offloading pattern is useful:

  1. Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;
  2. Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and
  3. Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Takeaways

Context management is usually the hardest part of building an agent. Programming the LLM to, as Karpathy says, “pack the context windows just right,” smartly deploying tools, information, and regular context maintenance, is the job of the agent designer.

The key insight across all the above tactics is that context is not free. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.

As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not, you now have six ways to fix it.


Footnotes

  1. Gemini 2.5 and GPT-4.1 have 1 million token context windows, large enough to throw Infinite Jest in there with plenty of room to spare.
  2. The “Long form text” section in the Gemini docs sum up this optmism nicely.
  3. In fact, in the Databricks study cited above, a frequent way models would fail when given long contexts is they’d return summarizations of the provided context while ignoring any instructions contained within the prompt.
  4. If you’re on the leaderboard, pay attention to the “Live (AST)” columns. These metrics use real-world tool definitions contributed to the product by enterprise, “avoiding the drawbacks of dataset contamination and biased benchmarks.”
  5. Hell, this entire list of tactics is a strong argument for why you should program your contexts.

❌
❌