The Movie Industry Kept Trying to Make Popcorn Buckets Happen Again This Year
The viral 'Dune: Part Two' sandworm emerged in early 2024, and its impact is still being felt whether we like it or not.


MLOps is dead. Well, not really, but for many the job is evolving into LLMOps. In this episode, Abide AI founder and LLMOps author Abi Aryan joins Ben to discuss what LLMOps is and why it’s needed, particularly for agentic AI systems. Listen in to hear why LLMOps requires a new way of thinking about observability, why we should spend more time understanding human workflows before mimicking them with agents, how to do FinOps in the age of generative AI, and more.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right, so today we have Abi Aryan. She is the author of the O’Reilly book on LLMOps as well as the founder of Abide AI. So, Abi, welcome to the podcast.
00.19: Thank you so much, Ben.
00.21: All right. Let’s start with the book, which I confess, I just cracked open: LLMOps. People probably listening to this have heard of MLOps. So at a high level, the models have changed: They’re bigger, they’re generative, and so on and so forth. So since you’ve written this book, have you seen a wider acceptance of the need for LLMOps?
00.51: I think more recently there are more infrastructure companies. So there was a conference happening recently, and there was this sort of perception or messaging across the conference, which was “MLOps is dead.” Although I don’t agree with that.
There’s a big difference that companies have started to pick up on more recently, as the infrastructure around the space has sort of started to improve. They’re starting to realize how different the pipelines were that people managed and grew, especially for the older companies like Snorkel that were in this space for years and years before large language models came in. The way they were handling data pipelines—and even the observability platforms that we’re seeing today—have changed tremendously.
01.40: What about, Abi, the general. . .? We don’t have to go into specific tools, but we can if you want. But, you know, if you look at the old MLOps person and then fast-forward, this person is now an LLMOps person. So on a day-to-day basis [has] their suite of tools changed?
02.01: Massively. I think for an MLOps person, the focus was very much around “This is my model. How do I containerize my model, and how do I put it in production?” That was the entire problem and, you know, most of the work was around “Can I containerize it? What are the best practices around how I arrange my repository? Are we using templates?”
Drawbacks happened, but not as much because most of the time the stuff was tested and there was not too much indeterministic behavior within the models itself. Now that has changed.
02.38: [For] most of the LLMOps engineers, the biggest job right now is doing FinOps really, which is controlling the cost because the models are massive. The second thing, which has been a big difference, is we have shifted from “How can we build systems?” to “How can we build systems that can perform, and not just perform technically but perform behaviorally as well?”: “What is the cost of the model? But also what is the latency? And see what’s the throughput looking like? How are we managing the memory across different tasks?”
The problem has really shifted when we talk about it. . . So a lot of focus for MLOps was “Let’s create fantastic dashboards that can do everything.” Right now it’s no matter which dashboard you create, the monitoring is really very dynamic.
03.32: Yeah, yeah. As you were talking there, you know, I started thinking, yeah, of course, obviously now the inference is essentially a distributed computing problem, right? So that was not the case before. Now you have different phases even of the computation during inference, so you have the prefill phase and the decode phase. And then you might need different setups for those.
So anecdotally, Abi, did the people who were MLOps people successfully migrate themselves? Were they able to upskill themselves to become LLMOps engineers?
04.14: I know a couple of friends who were MLOps engineers. They were teaching MLOps as well—Databricks folks, MVPs. And they were now transitioning to LLMOps.
But the way they started is they started focusing very much on, “Can you do evals for these models? They weren’t really dealing with the infrastructure side of it yet. And that was their slow transition. And right now they’re very much at that point where they’re thinking, “OK, can we make it easy to just catch these problems within the model—inferencing itself?”
04.49: A lot of other problems still stay unsolved. Then the other side, which was like a lot of software engineers who entered the field and became AI engineers, they have a much easier transition because software. . . The way I look at large language models is not just as another machine learning model but literally like software 3.0 in that way, which is it’s an end-to-end system that will run independently.
Now, the model isn’t just something you plug in. The model is the product tree. So for those people, most software is built around these ideas, which is, you know, we need a strong cohesion. We need low coupling. We need to think about “How are we doing microservices, how the communication happens between different tools that we’re using, how are we calling up our endpoints, how are we securing our endpoints?”
Those questions come easier. So the system design side of things comes easier to people who work in traditional software engineering. So the transition has been a little bit easier for them as compared to people who were traditionally like MLOps engineers.
05.59: And hopefully your book will help some of these MLOps people upskill themselves into this new world.
Let’s pivot quickly to agents. Obviously it’s a buzzword. Just like anything in the space, it means different things to different teams. So how do you distinguish agentic systems yourself?
06.24: There are two words in the space. One is agents; one is agent workflows. Basically agents are the components really. Or you can call them the model itself, but they’re trying to figure out what you meant, even if you forgot to tell them. That’s the core work of an agent. And the work of a workflow or the workflow of an agentic system, if you want to call it, is to tell these agents what to actually do. So one is responsible for execution; the other is responsible for the planning side of things.
07.02: I think sometimes when tech journalists write about these things, the general public gets the notion that there’s this monolithic model that does everything. But the reality is, most teams are moving away from that design as you, as you describe.
So they have an agent that acts as an orchestrator or planner and then parcels out the different steps or tasks needed, and then maybe reassembles in the end, right?
07.42: Coming back to your point, it’s now less of a problem of machine learning. It’s, again, more like a distributed systems problem because we have multiple agents. Some of these agents will have more load—they will be the frontend agents, which are communicating to a lot of people. Obviously, on the GPUs, these need more distribution.
08.02: And when it comes to the other agents that may not be used as much, they can be provisioned based on “This is the need, and this is the availability that we have.” So all of that provisioning again is a problem. The communication is a problem. Setting up tests across different tasks itself within an entire workflow, now that becomes a problem, which is where a lot of people are trying to implement context engineering. But it’s a very complicated problem to solve.
08.31: And then, Abi, there’s also the problem of compounding reliability. Let’s say, for example, you have an agentic workflow where one agent passes off to another agent and yet to another third agent. Each agent may have a certain amount of reliability, but it compounds over time. So it compounds across this pipeline, which makes it more challenging.
09.02: And that’s where there’s a lot of research work going on in the space. It’s an idea that I’ve talked about in the book as well. At that point when I was writing the book, especially chapter four, in which a lot of these were described, most of the companies right now are [using] monolithic architecture, but it’s not going to be able to sustain as we go towards application.
We have to go towards a microservices architecture. And the moment we go towards microservices architecture, there are a lot of problems. One will be the hardware problem. The other is consensus building, which is. . .
Let’s say you have three different agents spread across three different nodes, which would be running very differently. Let’s say one is running on an edge one hundred; one is running on something else. How can we achieve consensus if even one of the nodes ends up winning? So that’s open research work [where] people are trying to figure out, “Can we achieve consensus in agents based on whatever answer the majority is giving, or how do we really think about it?” It should be set up at a threshold at which, if it’s beyond this threshold, then you know, this perfectly works.
One of the frameworks that is trying to work in this space is called MassGen—they’re working on the research side of solving this problem itself in terms of the tool itself.
10.31: By the way, even back in the microservices days in software architecture, obviously people went overboard too. So I think that, as with any of these new things, there’s a bit of trial and error that you have to go through. And the better you can test your systems and have a setup where you can reproduce and try different things, the better off you are, because many times your first stab at designing your system may not be the right one. Right?
11.08: Yeah. And I’ll give you two examples of this. So AI companies tried to use a lot of agentic frameworks. You know people have used Crew; people have used n8n, they’ve used. . .
11.25: Oh, I hate those! Not I hate. . . Sorry. Sorry, my friends and crew.
11.30: And 90% of the people working in this space seriously have already made that transition, which is “We are going to write it ourselves.
The same happened for evaluation: There were a lot of evaluation tools out there. What they were doing on the surface is literally just tracing, and tracing wasn’t really solving the problem—it was just a beautiful dashboard that doesn’t really serve much purpose. Maybe for the business teams. But at least for the ML engineers who are supposed to debug these problems and, you know, optimize these systems, essentially, it was not giving much other than “What is the error response that we’re getting to everything?”
12.08: So again, for that one as well, most of the companies have developed their own evaluation frameworks in-house, as of now. The people who are just starting out, obviously they’ve done. But most of the companies that started working with large language models in 2023, they’ve tried every tool out there in 2023, 2024. And right now more and more people are staying away from the frameworks and launching and everything.
People have understood that most of the frameworks in this space are not superreliable.
12.41: And [are] also, honestly, a bit bloated. They come with too many things that you don’t need in many ways. . .
12:54: Security loopholes as well. So for example, like I reported one of the security loopholes with LangChain as well, with LangSmith back in 2024. So those things obviously get reported by people [and] get worked on, but the companies aren’t really proactively working on closing those security loopholes.
13.15: Two open source projects that I like that are not specifically agentic are DSPy and BAML. Wanted to give them a shout out. So this point I’m about to make, there’s no easy, clear-cut answer. But one thing I noticed, Abi, is that people will do the following, right? I’m going to take something we do, and I’m going to build agents to do the same thing. But the way we do things is I have a—I’m just making this up—I have a project manager and then I have a designer, I have role B, role C, and then there’s certain emails being exchanged.
So then the first step is “Let’s replicate not just the roles but kind of the exchange and communication.” And sometimes that actually increases the complexity of the design of your system because maybe you don’t need to do it the way the humans do it. Right? Maybe if you go to automation and agents, you don’t have to over-anthropomorphize your workflow. Right. So what do you think about this observation?
14.31: A very interesting analogy I’ll give you is people are trying to replicate intelligence without understanding what intelligence is. The same for consciousness. Everybody wants to replicate and create consciousness without understanding consciousness. So the same is happening with this as well, which is we are trying to replicate a human workflow without really understanding how humans work.
14.55: And sometimes humans may not be the most efficient thing. Like they exchange five emails to arrive at something.
15.04: And humans are never context defined. And in a very limiting sense. Even if somebody’s job is to do editing, they’re not just doing editing. They are looking at the flow. They are looking for a lot of things which you can’t really define. Obviously you can over a period of time, but it needs a lot of observation to understand. And that skill also depends on who the person is. Different people have different skills as well. Most of the agentic systems right now, they’re just glorified Zapier IFTTT routines. That’s the way I look at them right now. The if recipes: If this, then that.
15.48: Yeah, yeah. Robotic process automation I guess is what people call it. The other thing that people I don’t think understand just reading the popular tech press is that agents have levels of autonomy, right? Most teams don’t actually build an agent and unleash it full autonomous from day one.
I mean, I guess the analogy would be in self-driving cars: They have different levels of automation. Most enterprise AI teams realize that with agents, you have to kind of treat them that way too, depending on the complexity and the importance of the workflow.
So you go first very much a human is involved and then less and less human over time as you develop confidence in the agent.
But I think it’s not good practice to just kind of let an agent run wild. Especially right now.
16.56: It’s not, because who’s the person answering if the agent goes wrong? And that’s a question that has come up often. So this is the work that we’re doing at Abide really, which is trying to create a decision layer on top of the knowledge retrieval layer.
17.07: Most of the agents which are built using just large language models. . . LLMs—I think people need to understand this part—are fantastic at knowledge retrieval, but they do not know how to make decisions. If you think agents are independent decision makers and they can figure things out, no, they cannot figure things out. They can look at the database and try to do something.
Now, what they do may or may not be what you like, no matter how many rules you define across that. So what we really need to develop is some sort of symbolic language around how these agents are working, which is more like trying to give them a model of the world around “What is the cause and effect, with all of these decisions that you’re making? How do we prioritize one decision where the. . .? What was the reasoning behind that so that entire decision making reasoning here has been the missing part?”
18.02: You brought up the topic of observability. There’s two schools of thought here as far as agentic observability. The first one is we don’t need new tools. We have the tools. We just have to apply [them] to agents. And then the second, of course, is this is a new situation. So now we need to be able to do more. . . The observability tools have to be more capable because we’re dealing with nondeterministic systems.
And so maybe we need to capture more information along the way. Chains of decision, reasoning, traceability, and so on and so forth. Where do you fall in this kind of spectrum of we don’t need new tools or we need new tools?
18.48: We don’t need new tools, but we certainly need new frameworks, and especially a new way of thinking. Observability in the MLOps world—fantastic; it was just about tools. Now, people have to stop thinking about observability as just visibility into the system and start thinking of it as an anomaly detection problem. And that was something I’d written in the book as well. Now it’s no longer about “Can I see what my token length is?” No, that’s not enough. You have to look for anomalies at every single part of the layer across a lot of metrics.
19.24: So your position is we can use the existing tools. We may have to log more things.
19.33: We may have to log more things, and then start building simple ML models to be able to do anomaly detection.
Think of managing any machine, any LLM model, any agent as really like a fraud detection pipeline. So every single time you’re looking for “What are the simplest signs of fraud?” And that can happen across various factors. But we need more logging. And again you don’t need external tools for that. You can set up your own loggers as well.
Most of the people I know have been setting up their own loggers within their companies. So you can simply use telemetry to be able to a.) define a set and use the general logs, and b.) be able to define your own custom logs as well, depending on your agent pipeline itself. You can define “This is what it’s trying to do” and log more things across those things, and then start building small machine learning models to look for what’s going on over there.
20.36: So what is the state of “Where we are? How many teams are doing this?”
20.42: Very few. Very, very few. Maybe just the top bits. The ones who are doing reinforcement learning training and using RL environments, because that’s where they’re getting their data to do RL. But people who are not using RL to be able to retrain their model, they’re not really doing much of this part; they’re still depending very much on external accounts.
21.12: I’ll get back to RL in a second. But one topic you raised when you pointed out the transition from MLOps to LLMOps was the importance of FinOps, which is, for our listeners, basically managing your cloud computing costs—or in this case, increasingly mastering token economics. Because basically, it’s one of these things that I think can bite you.
For example, the first time you use Claude Code, you go, “Oh, man, this tool is powerful.” And then boom, you get an email with a bill. I see, that’s why it’s powerful. And you multiply that across the board to teams who are starting to maybe deploy some of these things. And you see the importance of FinOps.
So where are we, Abi, as far as tooling for FinOps in the age of generative AI and also the practice of FinOps in the age of generative AI?
22.19: Less than 5%, maybe even 2% of the way there.
22:24: Really? But obviously everyone’s aware of it, right? Because at some point, when you deploy, you become aware.
22.33: Not enough people. A lot of people just think about FinOps as cloud, basically the cloud cost. And there are different kinds of costs in the cloud. One of the things people are not doing enough is not profiling their models properly, which is [determining] “Where are the costs really coming from? Our models’ compute power? Are they taking too much RAM?
22.58: Or are we using reasoning when we don’t need it?
23.00: Exactly. Now that’s a problem we solve very differently. That’s where yes, you can do kernel fusion. Define your own custom kernels. Right now there’s a massive number of people who think we need to rewrite kernels for everything. It’s only going to solve one problem, which is the compute-bound problem. But it’s not going to solve the memory-bound problem. Your data engineering pipelines aren’t what’s going to solve your memory-bound problems.
And that’s where most of the focus is missing. I’ve mentioned it in the book as well: Data engineering is the foundation of first being able to solve the problems. And then we moved to the compute-bound problems. Do not start optimizing the kernels over there. And then the third part would be the communication-bound problem, which is “How do we make these GPUs talk smarter with each other? How do we figure out the agent consensus and all of those problems?”
Now that’s a communication problem. And that’s what happens when there are different levels of bandwidth. Everybody’s dealing with the internet bandwidth as well, the kind of serving speed as well, different kinds of cost and every kind of transitioning from one node to another. If we’re not really hosting our own infrastructure, then that’s a different problem, because it depends on “Which server do you get assigned your GPUs on again?”
24.20: Yeah, yeah, yeah. I want to give a shout out to Ray—I’m an advisor to Anyscale—because Ray basically is built for these sorts of pipelines because it can do fine-grained utilization and help you decide between CPU and GPU. And just generally, you don’t think that the teams are taking token economics seriously?
I guess not. How many people have I heard talking about caching, for example? Because if it’s a prompt that [has been] answered before, why do you have to go through it again?
25.07: I think plenty of people have started implementing KV caching, but they don’t really know. . . Again, one of the questions people don’t understand is “How much do we need to store in the memory itself, and how much do we need to store in the cache?” which is the big memory question. So that’s the one I don’t think people are able to solve. A lot of people are storing too much stuff in the cache that should actually be stored in the RAM itself, in the memory.
And there are generalist applications that don’t really understand that this agent doesn’t really need access to the memory. There’s no point. It’s just lost in the throughput really. So I think the problem isn’t really caching. The problem is that differentiation of understanding for people.
25.55: Yeah, yeah, I just threw that out as one element. Because obviously there’s many, many things to mastering token economics. So you, you brought up reinforcement learning. A few years ago, obviously people got really into “Let’s do fine-tuning.” But then they quickly realized. . . And actually fine-tuning became easy because basically there became so many services where you can just focus on labeled data. You upload your labeled data, boom, come back from lunch, you have a fine-tuned model.
But then people realize that “I fine-tuned, but the model that results isn’t really as good as my fine-tuning data.” And then obviously RAG and context engineering came into the picture. Now it seems like more people are again talking about reinforcement learning, but in the context of LLMs. And there’s a lot of libraries, many of them built on Ray, for example. But it seems like what’s missing, Abi, is that fine-tuning got to the point where I can sit down a domain expert and say, “Produce labeled data.” And basically the domain expert is a first-class participant in fine-tuning.
As best I can tell, for reinforcement learning, the tools aren’t there yet. The UX hasn’t been figured out in order to bring in the domain experts as the first-class citizen in the reinforcement learning process—which they need to be because a lot of the stuff really resides in their brain.
27.45: The big problem here, and very, very much to the point of what you pointed out, is the tools aren’t really there. And one very specific thing I can tell you is most of the reinforcement learning environments that you’re seeing are static environments. Agents are not learning statically. They are learning dynamically. If your RL environment cannot adapt dynamically, which basically in 2018, 2019, emerged as the OpenAI Gym and a lot of reinforcement learning libraries were coming out.
28.18: There is a line of work called curriculum learning, which is basically adapting your model’s difficulty to the results itself. So basically now that can be used in reinforcement learning, but I’ve not seen any practical implementation of using curriculum learning for reinforcement learning environments. So people create these environments—fantastic. They work well for a little bit of time, and then they become useless.
So that’s where even OpenAI, Anthropic, those companies are struggling as well. They’ve paid heavily in contracts, which are yearlong contracts to say, “Can you build this vertical environment? Can you build that vertical environment?” and that works fantastically But once the model learns on it, then there’s nothing else to learn. And then you go back into the question of, “Is this data fresh? Is this adaptive with the world?” And it becomes the same RAG problem over again.
29.18: So maybe the problem is with RL itself. Maybe maybe we need a different paradigm. It’s just too hard.
Let me close by looking to the future. The first thing is—the space is moving so hard, this might be an impossible question to ask, but if you look at, let’s say, 6 to 18 months, what are some things in the research domain that you think are not being talked enough about that might produce enough practical utility that we will start hearing about them in 6 to 12, 6 to 18 months?
29.55: One is how to profile your machine learning models, like the entire systems end-to-end. A lot of people do not understand them as systems, but only as models. So that’s one thing which will make a massive amount of difference. There are a lot of AI engineers today, but we don’t have enough system design engineers.
30.16: This is something that Ion Stoica at Sky Computing Lab has been giving keynotes about. Yeah. Interesting.
30.23: The second part is. . . I’m optimistic about seeing curriculum learning applied to reinforcement learning as well, where our RL environments can adapt in real time so when we train agents on them, they are dynamically adapting as well. That’s also [some] of the work being done by labs like Circana, which are working in artificial labs, artificial light frame, all of that stuff—evolution of any kind of machine learning model accuracy.
30.57: The third thing where I feel like the communities are falling behind massively is on the data engineering side. That’s where we have massive gains to get.
31.09: So on the data engineering side, I’m happy to say that I advise several companies in the space that are completely focused on tools for these new workloads and these new data types.
Last question for our listeners: What mindset shift or what skill do they need to pick up in order to position themselves in their career for the next 18 to 24 months?
31.40: For anybody who’s an AI engineer, a machine learning engineer, an LLMOps engineer, or an MLOps engineer, first learn how to profile your models. Start picking up Ray very quickly as a tool to just get started on, to see how distributed systems work. You can pick the LLM if you want, but start understanding distributed systems first. And once you start understanding those systems, then start looking back into the models itself.
32.11: And with that, thank you, Abi.

In this episode, Ben Lorica and Chris Butler, director of product operations for GitHub’s Synapse team, chat about the experimentation Chris is doing to incorporate generative AI into the product development process—particularly with the goal of reducing toil for cross-functional teams. It isn’t just automating busywork (although there’s some of that). He and his team have created agents that expose the right information at the right time, use feedback in meetings to develop “straw man” prototypes for the team to react to, and even offer critiques from specific perspectives (a CPO agent?). Very interesting stuff.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: Today we have Chris Butler of GitHub, where he leads a team called the Synapse. Welcome to the podcast, Chris.
00.15: Thank you. Yeah. Synapse is actually part of our product team and what we call EPD operations, which is engineering, product, and design. And our team is mostly engineers. I’m the product lead for it, but we help solve and reduce toil for these cross-functional teams inside of GitHub, mostly building internal tooling, with the focus on process automation and AI. But we also have a speculative part of our practice as well: trying to imagine the future of cross-functional teams working together and how they might do that with agents, for example.
00.45: Actually, you are the first person I’ve come across who’s used the word “toil.” Usually “tedium” is what people use, in terms of describing the parts of their job that they would rather automate. So you’re actually a big proponent of talking about agents that go beyond coding agents.
01.03: Yeah. That’s right.
01.05: And specifically in your context for product people.
01.09: And actually, for just the way that, say, product people work with their cross-functional teams. But I would also include other types of functions, legal privacy and customer support docs, any of these people that are working to actually help build a product; I think there needs to be a transformation of the way we think about these tools.
01.29: GitHub is a very engineering-led organization as well as a very engineering-focused organization. But my role is to really think about “How do we do a better job between all these people that I would call nontechnical—but they are sometimes technical, of course, but the people that are not necessarily there to write code. . . How do we actually work together to build great products?” And so that’s what I think about work.
01.48: For people who aren’t familiar with product management and product teams, what’s toil in the context of product teams?
02.00: So toil is actually something that I stole from a Google SRE from the standpoint of any type of thing that someone has to do that is manual, tactical, repetitive. . . It usually doesn’t really add to the value of the product in any way. It’s something that as the team gets bigger or the product goes down the SDLC or lifecycle, it scales linearly, with the fact that you’re building bigger and bigger things. And so it’s usually something that we want to try to cut out, because not only is it potentially a waste of time, but there’s also a perception within the team it can cause burnout.
02.35: If I have to constantly be doing toilsome parts of my work, I feel I’m doing things that don’t really matter rather than focusing on the things that really matter. And what I would argue is especially for product managers and cross-functional teams, a lot of the time that is processes that they have to use, usually to share information within larger organizations.
02.54: A good example of that is status reporting. Status reporting is one of those things where people will spend anywhere from 30 minutes to hours per week. And sometimes it’s in certain parts of the team—technical product managers, product managers, engineering managers, program managers are all dealing with this aspect that they have to in some way summarize the work that the team is doing and then shar[e] that not only with their leadership. . . They want to build trust with their leadership, that they’re making the right decisions, that they’re making the right calls. They’re able to escalate when they need help. But also then to convey information to other teams that are dependent on them or they’re dependent on. Again, this is [in] very large organizations, [where] there’s a huge cost to communication flows.
03.35: And so that’s why I use status reporting as a good example of that. Now with the use of the things like LLMs, especially if we think about our LLMs as a compression engine or a translation engine, we can then start to use these tools inside of these processes around status reporting to make it less toilsome. But there’s still aspects of it that we want to keep that are really about humans understanding, making decisions, things like that.
03:59: And this is key. So one of the concerns that people have is about a hollowing out in the following context: If you eliminate toil in general, the problem there is that your most junior or entry-level employees actually learn about the culture of the organization by doing toil. There’s some level of toil that becomes part of the onboarding in the acculturation of young employees. But on the other hand, this is a challenge for organizations to just change how they onboard new employees and what kinds of tasks they give them and how they learn more about the culture of the organization.
04.51: I would differentiate between the idea of toil and paying your dues within the organization. In investment banking, there’s a whole concern about that: “They just need to sit in the office for 12 hours a day to really get the culture here.” And I would differentiate that from. . .
05:04: Or “Get this slide to pitch decks and make sure all the fonts are the right fonts.”
05.11: That’s right. Yeah, I worked at Facebook Reality Labs, and there were many times where we would do a Zuck review, and getting those slides perfect was a huge task for the team. What I would say is I want to differentiate this from the gaining of expertise. So if we think about Gary Klein, naturalistic decision making, real expertise is actually about being able to see an environment. And that could be a data environment [or] information environment as well. And then as you gain expertise, you’re able to discern between important signals and noise. And so what I’m not advocating for is to remove the ability to gain that expertise. But I am saying that toilsome work doesn’t necessarily contribute to expertise.
05.49: In the case of status reporting as an example—status reporting is very valuable for a person to be able to understand what is going on with the team, and then, “What actions do I need to take?” And we don’t want to remove that. But the idea that a TPM or product manager or EM has to dig through all of the different issues that are inside of a particular repo to look for specific updates and then do their own synthesis of a draft, I think there is a difference there. And so what I would say is that the idea of me reading this information in a way that is very convenient for me to consume and then to be able to shape the signal that I then put out into the organization as a status report, that is still very much a human decision.
06.30: And I think that’s where we can start to use tools. Ethan Mollick has talked about this a lot in the way that he’s trying to approach including LLMs in, say, the classroom. There’s two patterns that I think could come out of this. One is that when I have some type of early draft of something, I should be able to get a lot of early feedback that is very low reputational risk. And what I mean by that is that a bot can tell me “Hey, this is not written in a way with the active voice” or “[This] is not really talking about the impact of this on the organization.” And so I can get that super early feedback in a way that is not going to hurt me.
If I publish a really bad status report, people may think less of me inside the organization. But using a bot or an agent or just a prompt to even just say, “Hey, these are the ways you can improve this”—that type of early feedback is really, really valuable. That I have a draft and I get critique from a bunch of different viewpoints I think is super valuable and will build expertise.
07.24: And then there’s the other side, which is, when we talk about consuming lots of information and then synthesizing or translating it into a draft, I can then critique “Is this actually valuable to the way that I think that this leader thinks? Or what I’m trying to convey as an impact?” And so then I am critiquing the straw man that is output by these prompts and agents.
07.46: Those two different patterns together actually create a really great loop for me to be able to learn not only from agents but also from the standpoint of seeing how. . . The part that ends up being really exciting is when once you start to connect the way communication happens inside the organization, I can then see what my leaders passed on to the next leader or what this person interpreted this as. And I can use that as a feedback loop to then improve, over time, my expertise in, say, writing a status report that is shaped for the leader. There’s also a whole thing that when we talk about status reporting in particular, there is a difference in expertise that people are getting that I’m not always 100%. . .
08.21: It’s valuable for me to understand how my leader thinks and makes decisions. I think that is very valuable. But the idea that I will spend hours and hours shaping and formulating a status report from my point of view for someone else can be aided by these types of systems. And so status should not be about the speaker’s mouth; it should be at the listener’s ear.
For these leaders, they want to be able to understand “Are the teams making the right decisions? Do I trust them? And then where should I preemptively intervene because of my experience or maybe my understanding of the context in the broader organization?” And so that’s what I would say: These tools are very valuable in helping build that expertise.
09.00: It’s just that we have to rethink “What is expertise?” And I just don’t buy it that paying your dues is the way you gain expertise. You do sometimes. Absolutely. But a lot of it is also just busy work and toil.
09.11: My thing is these are productivity tools. And so you make even your junior employees productive—you just change the way you use your more-junior employees.
09.24: Maybe just one thing to add to this is that there is something really interesting inside of the education world of using LLMs: trying to understand where someone is at. And so the type of feedback that someone that is very early in their career or first to doing something is potentially very different in the way that you’re teaching them or giving them feedback versus something that someone that is much further in expertise, they want to be able to just get down to “What are some things I’m missing here? Where am I biased?” Those are things where I think we also need to do a better job for those early employees, the people that are just starting to get expertise—“How do we train them using these tools as well as other ways?”
10.01: And I’ve done that as well. I do a lot of learning and development help, internal to companies, and I did that as part of the PM faculty for learning in development at Google. And so thinking a lot about how PMs gain expertise, I think we’re doing a real disservice to making it so that product manager as a junior position is so hard to get.
10.18: I think it’s really bad because, right out of college, I started doing program management, and it taught me so much about this. But at Microsoft, when I joined, we would say that the program manager wasn’t really worth very much for the first two years, right? Because they’re gaining expertise in this.
And so I think LLMs can help give the ability for people to gain expertise faster and also help them from avoiding making errors that other people might make. But I think there’s a lot to do with just learning and development in general that we need to pair with LLMs and human systems.
10.52: In terms of agents, I guess agents for product management, first of all, do they exist? And if they do, I always like to look at what level of autonomy they really have. Most agents really are still partially autonomous, right? There’s still a human in the loop. And so the question is “How much is the human in the loop?” It’s kind of like a self-driving car. There’s driver assists, and then there’s all the way to self-driving. A lot of the agents right now are “driver assist.”
11.28: I think you’re right. That’s why I don’t always use the term “agent,” because it’s not an autonomous system that is storing memory using tools, constantly operating.
I would argue though that there is no such thing as “human out of the loop.” We’re probably just drawing the system diagram wrong if we’re saying that there’s no human that’s involved in some way. That’s the first thing.
11.53: The second thing I’d say is that I think you’re right. A lot of the time right now, it ends up being when the human needs the help, we end up creating systems inside of GitHub; we have something that’s called GitHub spaces, which is really a custom GPT. It’s really just a bundling of context that I can then go to when I need help with a particular type of thing. We built very highly specific types of copilot spaces, like “I need to write a blog announcement about something. And so what’s the GitHub writing style? How should I be wording this avoiding jargon?” Internal things like that. So it can be highly specific.
We also have more general tools that are kind of like “How do I form and maintain initiatives throughout the entire software development lifecycle? When do I need certain types of feedback? When do I need to generate the 12 to 14 different documents that compliance and downstream teams need?” And so those tend to be operating in the background to autodraft these things based on the context that’s available. And so that’s I’d say that’s semiagentic, to a certain extent.
12.52: But I think actually there’s really big opportunities when it comes to. . . One of the cases that we’re working on right now is actually linking information in the GitHub graph that is not commonly linked. And so a key example of that might be kicking off all of the process that goes along with doing a release.
When I first get started, I actually want to know in our customer feedback repo, in all the different places where we store customer feedback, “Where are there times that customers actually asked about this or complained about it or had some information about this?” And so when I get started, being able to automatically link something like a release tracking issue with all of this customer feedback becomes really valuable. But it’s very hard for me as an individual to do that. And what we really want—and what we’re building—[are] things that are more and more autonomous about constantly searching for feedback or information that we can then connect to this release tracking issue.
13.44: So that’s why I say we’re starting to get into the autonomous realm when it comes to this idea of something going around looking for linkages that don’t exist today. And so that’s one of those things, because again, we’re talking about information flow. And a lot of the time, especially in organizations the size of GitHub, there’s lots of siloing that takes place.
We have lots of repos. We have lots of information. And so it’s really hard for a single person to ever keep all of that in their head and to know where to go, and so [we’re] bringing all of that into the tools that they end up using.
14.14: So for example, we’ve also created internal things—these are more assist-type use cases—but the idea of a Gemini Gem inside of a Google doc or an M365 agent inside of Word that is then also connected to the GitHub graph in some way. I think it’s “When do we expose this information? Is it always happening in the background, or is it only when I’m drafting the next version of this initiative that ends up becoming really, really important?”
14.41: Some of the work we’ve been experimenting with is actually “How do we start to include agents inside of the synchronous meetings that we actually do?” You probably don’t want an agent to suddenly start speaking, especially because there’s lots of different agents that you may want to have in a meeting.
We don’t have a designer on our team, so I actually end up using an agent that is prompted to be like a designer and think like a designer inside of these meetings. And so we probably don’t want them to speak up dynamically inside the meeting, but we do want them to add information if it’s helpful.
We want to autoprototype things as a straw man for us to be able to react to. We want to start to use our planning agents and stuff like that to help us plan out “What is the work that might need to take place?” It’s a lot of experimentation about “How do we actually pull things into the places that humans are doing the work?”—which is usually synchronous meetings, some types of asynchronous communication like Teams or Slack, things like that.
15.32: So that’s where I’d say the full possibility [is] for, say, a PM. And our customers are also TPMs and leaders and people like that. It really has to do with “How are we linking synchronous and asynchronous conversations with all of this information that is out there in the ecosystem of our organization that we don’t know about yet, or viewpoints that we don’t have that we need to have in this conversation?”
15.55: You mentioned the notion of a design agent passively in the background, attending a meeting. This is fascinating. So this design agent, what is it? Is it a fine-tuned agent or. . .? What exactly makes it a design agent?
16.13: In this particular case, it’s a specific prompt that defines what a designer would usually do in a cross-functional team and what they might ask questions about, what they would want clarification of. . .
16.26: Completely reliant on the pretrained foundation model—no posttraining, no RAG, nothing?
16.32: No, no. [Everything is in the prompt] at this point.
16.36: How big is this prompt?
16.37: It’s not that big. I’d say it’s maybe at most 50 lines, something like that. It’s pretty small. The truth is, the idea of a designer is something that LLMs know about. But more for our specific case, right now it’s really just based on this live conversation. And there’s a lot of papercuts in the way that we have to do a site call, pull a live transcript, put it into a space, and [then] I have a bunch of different agents that are inside the space that will then pipe up when they have something interesting to say, essentially.
And it’s a little weird because I have to share my screen and people have to read it, hold the meeting. So it’s clunky right now in the way that we bring this in. But what it will bring up is “Hey, these are patterns inside of design that you may want to think about.” Or you know, “For this particular part of the experience, it’s still pretty ambiguous. Do you want to define more about what this part of the process is?” And we’ve also included legal, privacy, data-oriented groups. Even the idea of a facilitator agent saying that we were getting off track or we have these other things to discuss, that type of stuff. So again, these are really rudimentary right now.
17.37: Now, what I could imagine though is, we have a design system inside of GitHub. How might we start to use that design system and use internal prototyping tools to autogenerate possibilities for what we’re talking about? And I guess when I think about using prototyping as a PM, I don’t think the PMs should be vibe coding everything.
I don’t think the prototype replaces a lot of the cross-functional documents that we have today. But I think what it does increase is that if we have been talking about a feature for about 30 minutes, that is a lot of interesting context that if we can say, “Autogenerate three different prototypes that are coming from slightly different directions, slightly different places that we might integrate inside of our current product,” I think what it does is it gives us, again, that straw man for us to be able to critique, which will then uncover additional assumptions, additional values, additional principles that we maybe haven’t written down somewhere else.
18.32: And so I see that as super valuable. And that’s the thing that we end up doing—we’ll use an internal product for prototyping to just take that and then have it autogenerated. It takes a little while right now, you know, a couple minutes to do a prototype generation. And so in those cases we’ll just [say], “Here’s what we thought about so far. Just give us a prototype.” And again it doesn’t always do the right thing, but at least it gives us something to now talk about because it’s more real now. It is not the thing that we end up implementing, but it is the thing that we end up talking about.
18.59: By the way, this notion of an agent attending synchronous some meeting, you can imagine taking it to the next level, which is to take advantage of multimodal models. The agent can then absorb speech and maybe visual cues, so then basically when the agent suggests something and someone reacts with a frown. . .
19.25: I think there’s something really interesting about that. And when you talk about multimodal, I do think that one of the things that is really important about human communication is the way that we pick up cues from each other—if we think about it, the reason why we actually talk to each other. . . And there’s a great book called The Enigma of Reason that’s all about this.
But their hypothesis is that, yes, we can try to logic or pretend to logic inside of our own heads, but we actually do a lot of post hoc analysis. So we come up with an idea inside our head. We have some certainty around it, some intuition, and then we fit it to why we thought about this. So that’s what we do internally.
But when you and I are talking, I’m actually trying to read your mind in some way. I’m trying to understand the norms that are at play. And I’m using your facial expression. I’m using your tone of voice. I’m using what you’re saying—actually way less of what you’re saying and more your facial expression and your tone of voice—to determine what’s going on.
20.16: And so I think this idea of engagement with these tools and the way these tools work, I think [of] the idea of gaze tracking: What are people looking at? What are people talking about? How are people reacting to this? And then I think this is where in the future, in some of the early prototypes we built internally for what the synchronous meeting would look like, we have it where the agent is raising its hand and saying, “Here’s an issue that we may want to discuss.” If the people want to discuss it, they can discuss it, or they can ignore it.
20.41: Longer term, we have to start to think about how agents are fitting into the turn-taking of conversation with the rest of the group. And using all of these multimodal cues ends up being very interesting, because you wouldn’t want just an agent whenever it thinks of something to just blurt it out.
20.59: And so there’s a lot of work to do here, but I think there’s something really exciting about just using engagement as the meaning to understand what are the hot topics, but also trying to help detect “Are we rat-holing on something that should be put in the parking lot?” Those are things and cues that we can start to get from these systems as well.
21.16: By the way, context has multiple dimensions. So you can imagine in a meeting between the two of us, you outrank me. You’re my manager. But then it turns out the agent realizes, “Well, actually, looking through the data in the company, Ben knows more about this topic than Chris. So maybe when I start absorbing their input, I should weigh Ben’s, even though in the org chart Chris outranks Ben.”
21.46: A related story is one of the things I’ve created inside of a copilot space is actually a proxy for our CPO. And so what I’ve done is I’ve taken meetings that he’s done where he asked questions in a smaller setting, taking his writing samples and things that, and I’ve tried to turn it into a, not really an agent, but a space where I can say, “Here’s what I’m thinking about for this plan. And what would Mario [Rodriguez] potentially think about this?”
It’s definitely not 100% accurate in any way. Mario’s an individual that is constantly changing and is learning and has intuitions that he doesn’t say out loud, but it is interesting how it does sound like him. It does seem to focus on questions that he would bring up in a previous meeting based on the context that we provided. And so I think to your point, a lot of things that right now are said inside of meetings that we then don’t use to actually help understand people’s points of view in a deeper way.
22.40: You could imagine that this proxy also could be used for [determining] potential blind spots for Mario that, as a person that is working on this, I may need to deal with, in the sense that maybe he’s not always focused on this type of issue, but I think it’s a really big deal. So how do I help him actually understand what’s going on?
22.57: And this gets back to that reporting: Is that the listener’s ear? What does that person actually care about? What do they need to know about to build trust with the team? What do they need to take action on? Those are things that I think we can start to build interesting profiles.
There’s a really interesting ethical question, which is: Should that person be able to write their own proxy? Would it include the blind spots that they have or not? And then maybe compare this to—you know, there’s [been] a trend for a little while where every leader would write their own user manual or readme, and inside of those things, they tend to be a bit more performative. It’s more about how they idealize their behavior versus the way that they actually are.
23.37: And so there’s some interesting problems that start to come up when we’re doing proxying. I don’t call it a digital twin of a person, because digital twins to me are basically simulations of mechanical things. But to me it’s “What is this proxy that might sit in this meeting to help give us a perspective and maybe even identify when this is something we should escalate to that person?”
23.55: I think there’s lots of very interesting things. Power structures inside of the organization are really hard to discern because there’s both, to your point, hierarchical ones that are very set in the systems that are there, but there’s also unsaid ones.
I mean, one funny story is Ray Dalio did try to implement this inside of his hedge fund. And unfortunately, I guess, for him, there were two people that were considered to be higher ranking in reputation than him. But then he changed the system so that he was ranked number one. So I guess we have to worry about this type of thing for these proxies as well.
24.27: One of the reasons why coding is such a great playground for these things is one, you can validate the result. But secondly, the data is quite tame and relatively right. So you have version control systems GitHub—you can look through that and say, “Hey, actually Ben’s commits are much more valuable than Chris’s commits.” Or “Ben is the one who suggested all of these changes before, and they were all accepted. So maybe we should really take Ben’s opinion much more strong[ly].” I don’t know what artifacts you have in the product management space that can help develop this reputation score.
25.09: Yeah. It’s tough because a reputation score, especially once you start to monitor some type of metric and it becomes the goal, that’s where we get into problems. For example, Agile teams adopting velocity as a metric: It’s meant to be an internal metric that helps us understand “If this person is out, how does that adjust what type of work we need to do?” But then comparing velocities between different teams ends up creating a whole can of worms around “Is this actually the metric that we’re trying to optimize for?”
25.37: And even when it comes to product management, what I would say is actually valuable a lot of the time is “Does the team understand why they’re working on something? How does it link to the broader strategy? How does this solve both business and customer needs? And then how are we wrangling this uncertainty of the world?”
I would argue that a really key meta skill for product managers—and for other people like generative user researchers, business development people, you know, even leaders inside the organization—they have to deal with a lot of uncertainty. And it’s not that we need to shut down the uncertainty, because actually uncertainty is an advantage that we should take advantage of and something we should use in some way. But there are places where we need to be able to build enough certainty for the team to do their work and then make plans that are resilient in the future uncertainty.
26.24: And then finally, the ability to communicate what the team is doing and why it’s important is very valuable. Unfortunately, there’s not a lot of. . . Maybe there’s rubrics we can build. And that’s actually what career ladders try to do for product managers. But they tend to be very vague actually. And as you get more senior inside of a product manager organization, you start to see things—it’s really just broader views, more complexity. That’s really what we start to judge product managers on. Because of that fact, it’s really about “How are you working across the team?”
26.55: There will be cases, though, that we can start to say, “Is this thing thought out well enough at first, at least for the team to be able to take action?” And then linking that work as a team to outcomes ends up being something that we can apply more and more data rigor to. But I worry about it being “This initiative brief was perfect, and so that meant the success of the product,” when the reality was that was maybe the starting point, but there was all this other stuff that the product manager and the team was doing together. So I’m always wary of that. And that’s where performance management for PMs is actually pretty hard: where you have to base most of your understanding on how they work with the other teammates inside their team.
27.35: You’ve been in product for a long time so you have a lot of you have a network of peers in other companies, right? What are one or two examples of the use of AI—not in GitHub—in the product management context that you admire?
27.53: For a lot of the people that I know that are inside of startups that are basically using prototyping tools to build out their initial product, I have a lot of, not necessarily envy, but I respect that a lot because you have to be so scrappy inside of a startup, and you’re really there to not only prove something to a customer, or actually not even prove something, but get validation from customers that you’re building the right thing. And so I think that type of rapid prototyping is something that is super valuable for that stage of an organization.
28.26: When I start to then look at larger enterprises, what I do see that I think is not as well a help with these prototyping tools is what we’ll call brownfield development: We need to build something on top of this other thing. It’s actually hard to use these tools today to imagine new things inside of a current ecosystem or a current design system.
28.46: [For] a lot of the teams that are in other places, it really is a struggle to get access to some of these tools. The thing that’s holding back the biggest enterprises from actually doing interesting work in this area is they’re overconstraining what their engineers [and] product managers can use as far as these tools.
And so what’s actually being created is shadow systems, where the person is using their personal ChatGPT to actually do the work rather than something that’s within the compliance of the organization.
29:18: Which is great for IP protection.
29:19: Exactly! That’s the problem, right? Some of this stuff, you do want to use the most current tools. Because there is actually not just [the] time savings aspect and toil reduction aspects—there’s also just the fact that it helps you think differently, especially if you’re an expert in your domain. It really aids you in becoming even better at what you’re doing. And then it also shores up some of your weaknesses. Those are the things that really expert people are using these types of tools for. But in the end, it comes down to a combination of legal, HR, and IT, and budgetary types of things too, that are holding back some of these organizations.
30.00: When I’m talking to other people inside of the orgs. . . Maybe another problem for enterprises right now is that a lot of these tools require lots of different context. We’ve benefited inside of GitHub in that a lot of our context is inside the GitHub graph, so Copilot can access it and use it. But for other teams they keep things and all of these individual vendor platforms.
And so the biggest problem then ends up being “How do we merge these different pieces of context in a way that is allowed?” When I first started working in the team of Synapse, I looked at the patterns that we were building and it was like “If we just had access to Zapier or Relay or something like that, that is exactly what we need right now.” Except we would not have any of the approvals for the connectors to all of these different systems. And so Airtable is a great example of something like that too: They’re building out process automation platforms that focus on data as well as connecting to other data sources, plus the idea of including LLMs as components inside these processes.
30.58: A really big issue I see for enterprises in general is the connectivity issue between all the datasets. And there are, of course, teams that are working on this—Glean or others that are trying to be more of an overall data copilot frontend for your entire enterprise datasets. But I just haven’t seen as much success in getting all these connected.
31.17: I think one of the things that people don’t realize is enterprise search is not turnkey. You have to get in there and really do all these integrations. There’s no shortcuts. There’s no, if a vendor comes to you and says, yeah, just use our system, it all magically works.
31.37: This is why we need to hire more people with degrees in library science, because they actually know how to manage these types of systems. Again, my first cutting my teeth on this was in very early versions of SharePoint a long time ago. And even inside there, there’s so much that you need to do to just help people with not only organization of the data but even just the search itself.
It’s not just a search index problem. It’s a bunch of different things. And that’s why whenever we’re shown an empty text box, that’s why there’s so much work that goes into just behind that; inside of Google, all of the instant answers, there’s lots of different ways that a particular search query is actually looked at, not just to go against the search index but to also just provide you the right information. And now they’re trying to include Gemini by default in there. The same thing happens within any copilot. There’s a million different things you could use.
32.27: And so I guess maybe this gets to my hypothesis about the way that agents will be valuable, either fully autonomous ones or ones that are attached to a particular process. But having many different agents that are highly biased in a particular way. And I use the term bias as in bias can be good, neutral, and bad, right? I don’t mean bias in a way of unfairness and that type of stuff; I mean more from the standpoint of “This agent is meant to represent this viewpoint, and it’s going to give you feedback from this viewpoint.” That ends up becoming really, really valuable because of that fact that you will not always be thinking about everything.
33.00: I’ve done a lot of work in adversarial thinking and red teaming and stuff like that. One of the things that is most valuable is to build prompts that are breaking the sycophancy of these different models that are there by default, because it should be about challenging my thinking rather than just agreeing with it.
And then the standpoint of each one of these highly biased agents actually helps provide a very interesting approach. I mean, if we go to things like meeting facilitation or workshop facilitation groups, this is why. . . I don’t know if you’re familiar with the six hats, but the six hats is a technique by which we declare inside of a meeting that I’m going to be the one that’s all positivity. This person’s going to be the one about data. This person’s gonna be the one that’s the adversarial, negative one, etc., etc. When you have all of these different viewpoints, you actually end up because of the tensions in the discussion of those ideas, the creation of options, the weighing of options, I think you end up making much better decisions. That’s where I think those highly biased viewpoints end up becoming really valuable.
34.00: For product people who are early in their career or want to enter the field, what are some resources that they should be looking at in terms of leveling up on the use AI in this context?
34.17: The first thing is there are millions of prompt libraries out there for product managers. What you should do is when you are creating work, you should be using a lot of these prompts to give you feedback, and you can actually even write your own, if you want to. But I would say there’s lots of material out there for “I need to write this thing.”
What is a way to [do something like] “I try to write it and then I get critique”? But then how might this AI system, through a prompt, generate a draft of this thing? And then I go in and look at it and say, “Which things are not actually quite right here?” And I think that again, those two patterns of getting critique and giving critique end up building a lot of expertise.
34.55: I think also within the organization itself, I believe an awful lot in things that are called basically “learning from your peers.” Being able to join small groups where you are getting feedback from your peers and including AI agent feedback inside of the small peer groups is very valuable.
There’s another technique, which is using case studies. And I actually, as part of my learning development practice, do something called “decision forcing cases” where we take a story that actually happened, we walk people through it and we ask them, “What do they think is happening; what would they do next?” But having that where you do those types of things across junior and senior people, you can start to actually learn the expertise from the senior people through these types of case studies.
35.37: I think there’s an awful lot more that senior leaders inside the organization should be doing. And as junior people inside your organization, you should be going to these senior leaders and saying, “How do you think about this? What is the way that you make these decisions?” Because what you’re actually pulling from is their past experience and expertise that they’ve gained to build that intuition.
35.53: There’s all sorts of surveys of programmers and engineers and AI. Are there surveys about product managers? Are they freaked out or what? What’s the state of adoption and this kind of thing?
36.00: Almost every PM that I’ve met has used an LLM in some way, to help them with their writing in particular. And if you look at the studies by ChatGPT or OpenAI about the use of ChatGPT, a lot of the writing tasks end up being from a product manager or senior leader standpoint. I think people are freaked out because every practice says that this other practice is going to be replaced because I can in some way replace them right now with a viewpoint.
36.38: I don’t think product management will go away. We may change the terminology that we end up using. But this idea of someone that is helping manage the complexity of the team, help with communication, help with [the] decision-making process inside that team is still very valuable and will be valuable even when we can start to autodraft a PRD.
I would argue that the draft of the PRD is not what matters. It’s actually the discussions that take place in the team after the PRD is created. And I don’t think that designers are going to take over the PM work because, yes, it is about to a certain extent the interaction patterns and the usability of things and the design and the feeling of things. But there’s all these other things that you need to worry about when it comes to matching it to business models, matching it to customer mindsets, deciding which problems to solve. They’re doing that.
37.27: There’s a lot of this concern about [how] every practice is saying this other practice is going to go away because of AI. I just don’t think that’s true. I just think we’re all going to be given different levels of abstraction to gain expertise on. But the core of what we do—an engineer focusing on what is maintainable and buildable and actually something that we want to work on versus the designer that’s building something usable and something that people will feel good using, and a product manager making sure that we’re actually building the thing that is best for the company and the user—those are things that will continue to exist even with these AI tools, prototyping tools, etc.
38.01: And for our listeners, as Chris mentioned, there’s many, many prompt templates for product managers. We’ll try to get Chris to recommend one, and we’ll put it in the episode notes. [See “Resources from Chris” below.] And with that thank you, Chris.
38.18: Thank you very much. Great to be here.
Here’s what Chris shared with us following the recording:
There are two [prompt resources for product managers] that I think people should check out:
However, I’d say that people should take these as a starting point and they should adapt them for their own needs. There is always going to be nuance for their roles, so they should look at how people do the prompting and modify for their own use. I tend to look at other people’s prompts and then write my own.
If they are thinking about using prompts frequently, I’d make a plug for Copilot Spaces to pull that context together.

In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.
00.23: Thanks, Ben. Thanks for having me on here.
00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today?
00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models.
And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing.
02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt.
But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse.
And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense.
03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window.
04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing.
04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific.
04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation.
05.14: And then I think one of the very immediate lessons that people realize is, just because. . .
So one of the things that these model providers do when they have a model release, one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.
05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope.
And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . . Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.
07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.
And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge.
08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model.
08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.
Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . .
09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem.
Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.
09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.
And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in.
10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context.
11.08: This is a really big problem.
11.09: So this is true even if the context window’s small actually.
11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform.
A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.
12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.”
The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time.
But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs.
13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.
I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting.
14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”
OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt.
15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails.
16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over.
17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.
18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.
I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.
19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?
19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you.
19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .
19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like.
I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building.
20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he.
21.15: This is back in the Objective-C days. . .
21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert.
22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data.
22.37: Throw in BAML?
22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.
22.52: BAML and TextGrad?
22.55: TextGrad is more like the prompt optimization I’m talking about.
22:57: TextGrad plus GEPA plus Regolo?
23.02: Yeah, those things are really important. And the reason I say they’re important is. . .
23.08: I mean, Drew, those are kind of advanced topics.
23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.
23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .
23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy.
23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.
24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.
And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time.
25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.”
25.16: In the loop. . .
25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .
25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically.
25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?
Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline.
26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.
27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me.
27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.
28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat.
29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.
We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.
30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.
Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.
31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience.
31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew?
31.57: [laughs] I mean, I wish you had given me some time to think about it.
31.58: We are in a hype cycle here. . .
32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different.
I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going.
But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.
33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces.
She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant.
34.34: That’s my handle on Twitter.
34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.
And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”
35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.
And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.
36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over.
I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you.
37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way.
37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties.
37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.
37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . .
38.00: Although I think it’s too boring a term, Drew, to take off.
38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things.
But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .
39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.
40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging.
And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term.
40.56: Now with that. Thank you, Drew.
40.58: Awesome. Thank you for having me, Ben.

A synergistic collaboration between NBA legend Carmelo Anthony and Skyworld prioritizes discipline, data and authenticity.
Carmelo Anthony and Skyworld build a trust-first cannabis collab is a post from: MJBizDaily: Financial, Legal & Cannabusiness news for cannabis entrepreneurs
In this episode, Ben Lorica and AI engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
This transcript was created with the help of AI and has been lightly edited for clarity.
0:00: Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.
0:14: Thanks, Ben. Huge fan of the work. I’ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O’Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here.
0:33: All right, so let’s jump right in. So one of the first things I really wanted to talk to you about is this work around PinLanding. And you’ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?
0:53: Yeah, that’s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We’re living through the greatest paradox of the digital economy. Essentially, there’s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom’s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that’s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we’re talking about a $6.5 trillion market, according to Shopify’s projections, where every failed product discovery is money left on the table. So that’s what we’re trying to solve—essentially solve the semantic organization of all platforms versus user context or search.
2:05: So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?
2:22: There have been researchers across the past decade working on this problem; we’re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier.
3:03: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together.
3:55: To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this?
4:11: Definitely. I think there are internal and external benchmarks. For the external ones, it’s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.
4:47: The other topic I wanted to talk to you about is recommendation systems. So obviously there’s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?
5:23: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are.
5:58: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc.
6:17: And I think of other bigger themes we’re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations.
6:44: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It’s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure.
7:15: Generally it sounds like the themes from years past still map over in the following sense, right? So there’s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?
7:53: Correct. Yes, I would say so.
7:55: And so the foundation models help you on the content side but not necessarily on the behavior side?
8:03: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it’s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?”
8:31: I’m not sure this is happening, so correct me if I’m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct?
9:05: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them.
9:29: For the listeners who don’t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications?
9:44: Yeah, that’s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this.
9:56: Maybe Faye, first define what you mean by that, in case listeners don’t know what that is.
10:02: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model.
10:24: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn’t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.
11:21: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they’re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that’s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we’re going to see a lot more in the production work as well.
11:57: By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference.
12:24: I think that’s very much true. Although I can’t claim to be an expert on it because I know most recommendation systems deal with monetization, so it’s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that…
12:42: And it’s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.
13:18: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you’re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about.
13:41: Which brings to mind another topic that is also something I’ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform.
14:28: I think that’s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you’re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true.
14:51: The last topic I wanted to talk to you about is context engineering. I’ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can’t just stuff the context window full, because one, it’s inefficient. And two, actually, the LLM can still make mistakes because it’s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work?
15:38: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it’s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?
16:08: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.
16:33: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline.
17:15: I’m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “Structure Is All You Need.” So basically whatever structure you have, you should help the model, right? If it’s in a SQL database, then maybe you can expose the structure of the data. If it’s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn’t make any sense to do that anyway.
18:30: What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar?
18:52: I think, to better utilize the concept of “contextual engineering,” that they’re essentially two loops. There’s number one within the loop of what happened. Yes. Within the LLMs. And then there’s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there’s the vector plus Excel or regex extraction. There’s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely.
20:07: One of the things I wish—and I don’t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it’ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they’re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it’ll impact what kinds of applications we’d be comfortable deploying these things in. For example, for agents. If I’m using an agent to use a bunch of tools, but I can’t really predict their behavior, that impacts the types of applications I’d be comfortable using a model for.
21:18: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you’re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that’s the fully autonomous engineer peer. It just takes things, and you don’t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin’s parent company.
22:05: And with that, thank you, Faye.
22:08: Awesome. Thank you, Ben.

Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.

Everyone is talking about agents: single agents and, increasingly, multi-agent systems. What kind of applications will we build with agents, and how will we build with them? How will agents communicate with each other effectively? Why do we need a protocol like A2A to specify how they communicate? Join Ben Lorica as he talks with Heiko Hotz and Sokratis Kartakis about A2A and our agentic future.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.

Terry's Inspired Eggless Cake with orange, chocolate, strawberries & cream ... a celebration of flavours to begin the year and possibly one of my most delicious eggless layer cakes to date. Winter is a charm for baking a layer cake since everything sets well, holds up beautifully and behaves as expected. This orange cake was nothing short of amazing, every layer in perfect harmony.
The post Terry’s Inspired Chocolate Orange Eggless Cake … a celebration of flavours appeared first on Passionate About Baking.
Forget the hounds. Police in China are releasing the squirrels.
Law enforcement in the city of Chongqing reportedly announced that it is training a team of drug-sniffing squirrels to help locate illicit substances and contraband.
Insider reports that the police dog brigade in the city, located in southwestern China, “now has a team of six red squirrels to help them sniff out drugs in the nooks and crannies of warehouses and storage units.”
According to Insider, “Chongqing police told the state-linked media outlet The Paper that these squirrels are small and agile, and able to search through tiny spaces in warehouses and storage units that dogs cannot reach,” and that the “squirrels have been trained to use their claws to scratch boxes in order to alert their handlers if they detect drugs, the police said.”
“Squirrels have a very good sense of smell. However, it’s less mature for us to train rodents for drug search in the past in terms of the technology,” said Yin Jin, a handler with the police dog brigade of the Hechuan Public Security Bureau in Chongqing, as quoted by the Chinese state-affiliated English newspaper Global Times.
“Our self-developed training system can be applied to the training of various animals,” Yin added.
The newspaper noted that in contrast to drug dogs, “squirrels are small and agile, which makes them good at searching high places for drugs.”
According to Insider, “China’s drug-sniffing squirrels may well be the first of their kind,” although “animals and insects other than dogs have also been used to detect dangerous substances like explosives.”
“In 2002, the Pentagon backed a project to use bees to detect bombs. Meanwhile, Cambodia has deployed trained rats to help bomb-disposal squads trawl minefields for buried explosives,” Insider reported. “It is unclear if the Chongqing police intends to expand its force of drug-sniffing squirrels. It is also unclear how often the squirrel squad will be deployed.”
China is known for its strict and punitive anti-drug laws.
According to the publication Health and Human Rights Journal, “drug use [in China] is an administrative and not criminal offense; however, individuals detained by public security authorities are subject to coercive or compulsory ‘treatment.’”
The journal explains: “This approach has been subject to widespread condemnation, including repeated calls over the past decade by United Nations (UN) agencies, UN human rights experts, and human rights organizations for the country to close compulsory drug detention centers and increase voluntary, community-based alternatives. Nonetheless, between 2012 and 2018, the number of people in compulsory drug detention centers in China remained virtually unchanged, and the number enrolled in compulsory community-based treatment rose sharply.”
“In addition to these approaches, the government enters all people detained by public security authorities for drug use in China into a system called the Drug User Internet Dynamic Control and Early Warning System, or Dynamic Control System (DCS),” the journal continues. “This is a reporting and monitoring system launched by the Ministry of Public Security in 2006. Individuals are entered into the system regardless of whether they are dependent on drugs or subject to criminal or administrative detention; some individuals who may be stopped by public security but not formally detained may also be enrolled in the DCS”
The Dynamic Control System “acts as an extension of China’s drug control efforts by monitoring the movement of people in the system and alerting police when individuals, for example, use their identity documents when registering at a hotel, conducting business at a government office or bank, registering a mobile phone, applying for tertiary education, or traveling,” according to the journal.
The post Chinese Police Enlist Drug-Sniffing Squirrels appeared first on High Times.
Officials in New Zealand announced this week that they have completed a massive seizure of cocaine at sea, calling it a “major financial blow” to producers and traffickers of the drug.
Authorities there said on Wednesday that the seizure was a part of “Operation Hyrdros,” with New Zealand Police working in partnership with both New Zealand Customs Service and the New Zealand Defence Force.
The announcement said that “no arrests have been made at this stage,” but that “enquiries will continue into the shipment including liaison with our international partners.”
Members of those units intercepted “3.2 tonnes of cocaine afloat” in the Pacific Ocean. NZ Customs Service Acting Comptroller Bill Perry said that the “sheer scale of this seizure is estimated to have taken more than half a billion dollars’ worth of cocaine out of circulation.”
(The news agency United Press International described the seizure as a “3.5 ton haul of cocaine with a street value of $317 million in a major anti-drugs operation carried out in the middle of the Pacific.”)

“Customs is pleased to have helped prevent such a large amount of cocaine causing harm in communities here in New Zealand, Australia and elsewhere in the wider Pacific region,” Perry said. “It is a huge illustration of what lengths organised crime will go to with their global drug trafficking operations and shows that we are not exempt from major organised criminal drug smuggling efforts in this part of the world.”
NZ Police Commissioner Andrew Coster called it “one of the single biggest seizures of illegal drugs by authorities in this country.”
“There is no doubt this discovery lands a major financial blow right from the South American producers through to the distributors of this product,” Coster said.
Coster added, “While this disrupts the syndicate’s operations, we remain vigilant given the lengths we know these groups will go to circumvent coming to law enforcement’s attention.”
The authorities said in the announcement on Wednesday that “eighty-one bales of the product have since made the six-day journey back to New Zealand aboard the Royal New Zealand Navy vessel HMNZS Manawanui, where they will now be destroyed.”
It is believed that “given the large size of the shipment it will have likely been destined for the Australian market,” according to the announcement.
Coster said that Operation Hyrdos “was initiated in December 2022, as part of our ongoing close working relationship with international partner agencies to identify and monitor suspicious vessels’ movements.”

“I am incredibly proud of what our National Organised Crime Group has achieved in working with other New Zealand agencies, including New Zealand Customs Service and the New Zealand Defence Force. The significance of this recovery and its impact cannot be underestimated,” Coster said.
“We know the distribution of any illicit drug causes a great amount of social harm as well as negative health and financial implications for communities, especially drug users and their families,” Coster added.
The announcement said that Coster noted that the “operation continues already successful work New Zealand authorities are achieving in working together and continues to lessen the impacts of transnational crime worldwide.”
New Zealand Defence Force Joint Forces commander Rear Admiral Jim Gilmour said that his unit “had the right people and the right capabilities to provide the support required and it was great to work alongside the New Zealand Police and the New Zealand Customs Service.”
“We were very pleased with the result and are happy to be a part of this successful operation and proud to play our part in protecting New Zealand,” Gilmour said.
The post New Zealand Officials Seize Half a Billion Dollars Worth of Cocaine appeared first on High Times.
Welcome to the original pumpkin record page. This is the official page identifying the heaviest pumpkins in the world per country. This is the current 2022 list of the Top 10 largest pumpkins grown to date. The majority of these pumpkins were grown in Europe and the UK, but only 4 of the pumpkins were grown in the USA. Europe […]
The post Top 10 Largest Pumpkins in the World & Country Records appeared first on Backyard Gardener.
Argentina officially launched a new government agency on Wednesday as part of an effort to bolster the country’s medical marijuana and hemp industry.
Reuters reports that the agency, known as the Regulatory Agency for the Hemp and Medicinal Cannabis Industry, or ARICCAME, represents “the first working group of a new national agency to regularize and promote the country’s nascent cannabis industry, which ministers hope will create new jobs and exports generating fresh income for the South American nation.”
“This opens the door for Argentina to start a new path in terms of industrial exports, on the basis of huge global demand,” said Argentina’s economy minister Sergio Massa at an event marking the launch of the new agency.
According to Reuters, “Massa said that the agency would from Thursday begin regularizing programs and coordinating with various provinces and [the] industrial sector, adding Argentina already counted on demand for projects linked to the agro-industrial sector.”
On the official website for ARICCAME, the agency outlines its mission and objectives.
“We are the Agency that regulates the import, export, cultivation, industrial production, manufacture, commercialization and acquisition, by any title, of seeds of the cannabis plant, cannabis and its derivative products for medicinal or industrial purposes,” the website reads, via an English translation.
The website lists the following “general objectives” for the agency: “Establish through the respective regulations, the regulatory framework for the entire production chain and national marketing and/or export of the Cannabis Sativa L. plant, seeds and derivatives for use in favor of health and industrial hemp; Promote a new agro-industrial productive sector for the commercial manufacture of medicines, phytotherapeutics, food and cosmetics for human use, medicines and food for veterinary use, as well as the different products made possible by industrial hemp; Generate the framework for the adaptation to the regulatory regime, of the cultivation and production of cannabis derivatives for use in existing health, guaranteeing the traceability and quality of the products in order to safeguard the right to health of the users of medical cannabis; Reintroduce hemp in Argentina and all its derivatives: food, construction materials, textile fiber, cellulose and bioplastics with low environmental impact; [and] Promote scientific research and sectoral technological progress, promoting favorable conditions for these existing industries in our country.”
ARICCAME’s specific objectives include: “Establish clear rules that provide legal certainty to the sector and encourage federal participation; Articulate through agreements and conventions with other State entities with intervention in the matter: INASE, SENASA, INTA, INTI, AFIP, INAES, BCRA, UIF, National Universities, etc; Determine the system of licenses and administrative authorizations for the productive chain; Generate quality standards that safeguard the right to health of users and consumers of cannabis/hemp products; [and] Control non-compliance with the regulatory regime.”
Argentine policymakers legalized cannabis oil for medical use in 2017. Three years later, the country legalized home cannabis cultivation for medical marijuana patients.
The launch of the new agency is part of a border effort by the Argentine government to continue to reform the medical cannabis program, something that the South American country identified as a priority last year.
According to Reuters, the newly launched agency will be helmed by Francisco Echarren, who “said the industry could generate thousands of new jobs, as well as create technological developments and new products for export.”
“We have a huge challenge ahead of us,” Echarren said, as quoted by Reuters, “not only getting a new industry on its feet, but giving millions of Argentines access to products that improve quality of life.”
The post Argentina Launches New Agency To Boost Cannabis Industry appeared first on High Times.
The post Detectify is #4 on Internetworld’s 2015 startup list appeared first on Detectify Blog.
Topic : Biggest Dark Web Scams of the world
The dark web, as you might have heard before, is a dangerous place to be in. There are drug sellers, criminals for hire, credit card fraudsters, and many other illegal businesses hosting services. Some dangerous hackers can ruin your entire hard-earned reputation by messing with your computer. It is the best policy and brilliant move to stay clear of such a place. But many people do go there to enact illegal activities and worse. We will discuss a few scams on the dark net that have been exposed. This article intends to help enlighten people to let them know more about how nefarious the dark web is and how important it is that they stay away from there.
They pose to be a service on the dark web to kill for money. They should take information and cash in bitcoins from their clients who assign them to kill a specific person. In this narrative, there are no hitmen in reality. There are no well-dressed assassins fitting silencers onto their guns, no operators assigned, and no capos commanding them.
But there is a series of websites, to be exact – where all of those things are manufactured to appear real. Some individuals believe it. They install Tor, a browser that employs encryption and a complicated relaying system to maintain anonymity and allows them to reach the dark web, where the website is, in search of a hitman.
Users of the website fill out a form under bogus names to order a murder. They deposit hundreds of bitcoins into the website’s virtual wallet. The website’s administrator is duping them: no assassinations are ever carried out.
The administrator would tell them some lies, why the job had been delayed, every time they contacted him to get info about the hit job they assigned. This is a scam, and many people fell for it. This is one of the many hitmen scams on the dark web.
A particular person went to the site called “The Green Machine”. He wanted to make some quick money and went to the hidden Wiki. He was trying to buy cloned cards. This person found the offer too rewarding that he could not but get entrapped in it. When he went to the site “The Green Machine”, he saw all reviews; hundreds of them were optimistic about the green machine or Mr Fungi as he was also known. This site was very cunning.
The page was first on the hidden Wiki, and all the reviews were positive, as mentioned earlier. Everything seemed legit, though everything was as fake as possible. However, it was impossible to find out. He emailed Mr. fungi or the Green Machine and got back a reply very quickly. However, this customer found something fishy and emailed Mr Fungi as one of his fraud victims.
Then he found out the reality and was stunned by the reply he got. Mr Fungi was quite proud of being a fraudster, and he took money from his victims in his wallet and never provided the service in return.
Alpha Cards is a dark web scam company. It is involved in many financial scams, and it has duped users of their money. However, the users were not innocent people, and they went to the dark web to commit financial fraud themselves. Alpha Cards operates in many economic segments, the primary being credit cards.
They have taken the money from their customers in bitcoins or Ethereum to sell them cloned credit card details with the pin codes they promised the users could use to purchase goods. After taking payments from them, they cheated their customers by providing them with nothing.
We have discussed a few dark web scams. The dark web is dangerous, and you should avoid going to the dark web to commit illegal activities that cheat ordinary people. In return, you could get scammed and lose a lot of your money.With this we have concluded our topic Biggest Dark Web Scams of the world
The post Biggest Dark Web Scams of the world appeared first on Darkweb24.net.