Normal view

There are new articles available, click to refresh the page.
Yesterday — 5 December 2025Main stream

Hackaday Podcast Episode 348: 50 Grams of PLA Hold a Ton, Phreaknic Badge is Off The Shelf, and Hackers Need Repair Manuals

By: Tom Nardi
5 December 2025 at 12:00

Join Hackaday Editors Elliot Williams and Tom Nardi as they go over their picks for the best stories and hacks from the previous week. Things start off with a warning about the long-term viability of SSD backups, after which the discussion moves onto the limits of 3D printed PLA, the return of the Pebble smart watch, some unconventional aircraft, and an online KiCad schematic repository that has plenty of potential. You’ll also hear about a remarkable conference badge made from e-waste electronic shelf labels, filling 3D prints with foam, and a tiny TV powered by the ESP32. The episode wraps up with our wish for hacker-friendly repair manuals, and an interesting tale of underwater engineering from D-Day.

Check out the links below if you want to follow along, and as always, tell us what you think about this episode in the comments!

As always, this episode is available in DRM-free MP3.

Where to Follow Hackaday Podcast

Episode 348 Show Notes:

News:

What’s that Sound?

  • Congratulations to [for_want_of_a_better_handle] for guessing the data center ambiance!

Interesting Hacks of the Week:

Quick Hacks:

Can’t-Miss Articles:

Before yesterdayMain stream

AWS CEO Matt Garman thought Amazon needed a million developers — until AI changed his mind

4 December 2025 at 18:56
AWS CEO Matt Garman, left, with Acquired hosts Ben Gilbert and David Rosenthal. (GeekWire Photo / Todd Bishop)

LAS VEGAS — Matt Garman remembers sitting in an Amazon leadership meeting six or seven years ago, thinking about the future, when he identified what he considered a looming crisis.

Garman, who has since become the Amazon Web Services CEO, calculated that the company would eventually need to hire a million developers to deliver on its product roadmap. The demand was so great that he considered the shortage of software development engineers (SDEs) the company’s biggest constraint.

With the rise of AI, he no longer thinks that’s the case.

Speaking with Acquired podcast hosts Ben Gilbert and David Rosenthal at the AWS re:Invent conference Thursday afternoon, Garman told the story in response to Gilbert’s closing question about what belief he held firmly in the past that he has since completely reversed.

“Before, we had way more ideas than we could possibly get to,” he said. Now, “because you can deliver things so fast, your constraint is going to be great ideas and great things that you want to go after. And I would never have guessed that 10 years ago.”

He was careful to point out that Amazon still needs great software engineers. But earlier in the conversation, he noted that massive technical projects that once required “dozens, if not hundreds” of people might now be delivered by teams of five or 10, thanks to AI and agents.

Garman was the closing speaker at the two-hour event with the hosts of the hit podcast, following conversations with Netflix Co-CEO Greg Peters, J.P. Morgan Payments Global Co-Head Max Neukirchen, and Perplexity Co-founder and CEO Aravind Srinivas.

A few more highlights from Garman’s comments:

Generative AI, including Bedrock, represents a multi-billion dollar business for Amazon. Asked to quantify how much of AWS is now AI-related, Garman said it’s getting harder to say, as AI becomes embedded in everything. 

Speaking off-the-cuff, he told the Acquired hosts that Bedrock is a multi-billion dollar business. Amazon clarified later that he was referring to the revenue run rate for generative AI overall. That includes Bedrock, which is Amazon’s managed service that offers access to AI models for building apps and services. [This has been updated since publication.]

How AWS thinks about its product strategy. Garman described a multi-layered approach to explain where AWS builds and where it leaves room for partners. At the bottom are core building blocks like compute and storage. AWS will always be there, he said.

In the middle are databases, analytics engines, and AI models, where AWS offers its own products and services alongside partners. At the top are millions of applications, where AWS builds selectively and only when it believes it has differentiated expertise.

Amazon is “particularly bad” at copying competitors. Garman was surprisingly blunt about what Amazon doesn’t do well. “One of the things that Amazon is particularly bad at is being a fast follower,” he said. “When we try to copy someone, we’re just bad at it.” 

The better formula, he said, is to think from first principles about solving a customer problem, only when it believes it has differentiated expertise, not simply to copy existing products.

Hackaday Podcast Episode 347: Breaking Kindles, Baby’s First Synth, and Barcodes!

28 November 2025 at 12:00

This week, Hackaday’s Elliot Williams and Kristina Panos met up over coffee to bring you the latest news, mystery sound, and of course, a big bunch of hacks from the previous seven days or so.

On What’s That Sound, Kristina got sort of close, but of course failed spectacularly. Will you fare better and perhaps win a Hackaday Podcast t-shirt? Mayhap you will.

After that, it’s on to the hacks and such, beginning with an interesting tack to take with a flat-Earther that involves two gyroscopes.  And we take a look at the design requirements when it comes to building synths for three-year-olds.

Then we discuss several awesome hacks such as a vehicle retrofit to add physical heated seat controls, an assistive radio that speaks the frequencies, and an acoustic radiometer build. Finally, we look at the joys of hacking an old Kindle, and get a handle on disappearing door handles.

Check out the links below if you want to follow along, and as always, tell us what you think about this episode in the comments!

Download in DRM-free MP3 and savor at your leisure.

Where to Follow Hackaday Podcast

Episode 347 Show Notes:

News:

  • No news is good news! So we talk about Thanksgiving and what we’ve learned recently.

What’s that Sound?

Interesting Hacks of the Week:

Quick Hacks:

Can’t-Miss Articles:

Bezos is back in startup mode, Amazon gets weird again, and the great old-car tech retrofit debate

22 November 2025 at 11:27

This week on the GeekWire Podcast: Jeff Bezos is back in startup mode (sort of) with Project Prometheus — a $6.2 billion AI-for-the-physical-world venture that instantly became one of the most talked-about new companies in tech. We dig into what this really means, why the company’s location is still a mystery, and how this echoes the era when Bezos was regularly launching big bets from Seattle.

Then we look at Amazon’s latest real-world experiment: package-return kiosks popping up inside Goodwill stores around the Seattle region. It’s a small pilot, but it brings back memories of the early days when Amazon’s oddball experiments seemed to appear out of nowhere.

And finally…Todd tries to justify his scheme to upgrade his beloved 2007 Toyota Camry with CarPlay, Android Auto, and a backup camera — while John questions the logic of sinking thousands of dollars into an old car.

All that, plus a mystery Microsoft shirt, a little Seattle nostalgia, and a look ahead to next week’s podcast collaboration with Me, Myself and AI from MIT Sloan Management Review.

With GeekWire co-founders John Cook and Todd Bishop.

Subscribe to GeekWire in Apple Podcasts, Spotify, or wherever you listen.

Hackaday Podcast Episode 346: Melting Metal in the Microwave, Unlocking Car Brakes and Washing Machines, and a Series of Tubes

21 November 2025 at 12:00

Wait, what? Is it time for the podcast again? Seems like only yesterday that Dan joined Elliot for the weekly rundown of the choicest hacks for the last 1/52 of a year. but here we are. We had quite a bit of news to talk about, including the winners of the Component Abuse Challenge — warning, some components were actually abused for this challenge. They’re also a trillion pages deep over at the Internet Archive, a milestone that seems worth celebrating.

As for projects, both of us kicked things off with “Right to repair”-adjacent topics, first with a washing machine that gave up its secrets with IR and then with a car that refused to let its owner fix the brakes. We heated things up with a microwave foundry capable of melting cast iron — watch your toes! — and looked at a tiny ESP32 dev board with ludicrously small components. We saw surveyors go to war, watched a Lego sorting machine go through its paces, and learned about radar by spinning up a sonar set from first principles.

Finally, we wrapped things up with another Al Williams signature “Can’t Miss Articles” section, with his deep dive into the fun hackers can have with the now-deprecated US penny, and his nostalgic look at pneumatic tube systems.

Download this 100% GMO-free MP3.

Where to Follow Hackaday Podcast

Episode 346 Show Notes:

News:

What’s that Sound?

  • [Andy Geppert] knew that was the annoying sound of the elevator at the Courtyard by Marriot hotel in Pasadena.

Interesting Hacks of the Week:

Quick Hacks:

Can’t-Miss Articles:

Generative AI in the Real World: The LLMOps Shift with Abi Aryan

20 November 2025 at 07:16

MLOps is dead. Well, not really, but for many the job is evolving into LLMOps. In this episode, Abide AI founder and LLMOps author Abi Aryan joins Ben to discuss what LLMOps is and why it’s needed, particularly for agentic AI systems. Listen in to hear why LLMOps requires a new way of thinking about observability, why we should spend more time understanding human workflows before mimicking them with agents, how to do FinOps in the age of generative AI, and more.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right, so today we have Abi Aryan. She is the author of the O’Reilly book on LLMOps as well as the founder of Abide AI. So, Abi, welcome to the podcast. 

00.19: Thank you so much, Ben. 

00.21: All right. Let’s start with the book, which I confess, I just cracked open: LLMOps. People probably listening to this have heard of MLOps. So at a high level, the models have changed: They’re bigger, they’re generative, and so on and so forth. So since you’ve written this book, have you seen a wider acceptance of the need for LLMOps? 

00.51: I think more recently there are more infrastructure companies. So there was a conference happening recently, and there was this sort of perception or messaging across the conference, which was “MLOps is dead.” Although I don’t agree with that. 

There’s a big difference that companies have started to pick up on more recently, as the infrastructure around the space has sort of started to improve. They’re starting to realize how different the pipelines were that people managed and grew, especially for the older companies like Snorkel that were in this space for years and years before large language models came in. The way they were handling data pipelines—and even the observability platforms that we’re seeing today—have changed tremendously.

01.40: What about, Abi, the general. . .? We don’t have to go into specific tools, but we can if you want. But, you know, if you look at the old MLOps person and then fast-forward, this person is now an LLMOps person. So on a day-to-day basis [has] their suite of tools changed? 

02.01: Massively. I think for an MLOps person, the focus was very much around “This is my model. How do I containerize my model, and how do I put it in production?” That was the entire problem and, you know, most of the work was around “Can I containerize it? What are the best practices around how I arrange my repository? Are we using templates?” 

Drawbacks happened, but not as much because most of the time the stuff was tested and there was not too much indeterministic behavior within the models itself. Now that has changed.

02.38: [For] most of the LLMOps engineers, the biggest job right now is doing FinOps really, which is controlling the cost because the models are massive. The second thing, which has been a big difference, is we have shifted from “How can we build systems?” to “How can we build systems that can perform, and not just perform technically but perform behaviorally as well?”: “What is the cost of the model? But also what is the latency? And see what’s the throughput looking like? How are we managing the memory across different tasks?” 

The problem has really shifted when we talk about it. . . So a lot of focus for MLOps was “Let’s create fantastic dashboards that can do everything.” Right now it’s no matter which dashboard you create, the monitoring is really very dynamic. 

03.32: Yeah, yeah. As you were talking there, you know, I started thinking, yeah, of course, obviously now the inference is essentially a distributed computing problem, right? So that was not the case before. Now you have different phases even of the computation during inference, so you have the prefill phase and the decode phase. And then you might need different setups for those. 

So anecdotally, Abi, did the people who were MLOps people successfully migrate themselves? Were they able to upskill themselves to become LLMOps engineers?

04.14: I know a couple of friends who were MLOps engineers. They were teaching MLOps as well—Databricks folks, MVPs. And they were now transitioning to LLMOps.

But the way they started is they started focusing very much on, “Can you do evals for these models? They weren’t really dealing with the infrastructure side of it yet. And that was their slow transition. And right now they’re very much at that point where they’re thinking, “OK, can we make it easy to just catch these problems within the model—inferencing itself?”

04.49: A lot of other problems still stay unsolved. Then the other side, which was like a lot of software engineers who entered the field and became AI engineers, they have a much easier transition because software. . . The way I look at large language models is not just as another machine learning model but literally like software 3.0 in that way, which is it’s an end-to-end system that will run independently.

Now, the model isn’t just something you plug in. The model is the product tree. So for those people, most software is built around these ideas, which is, you know, we need a strong cohesion. We need low coupling. We need to think about “How are we doing microservices, how the communication happens between different tools that we’re using, how are we calling up our endpoints, how are we securing our endpoints?”

Those questions come easier. So the system design side of things comes easier to people who work in traditional software engineering. So the transition has been a little bit easier for them as compared to people who were traditionally like MLOps engineers. 

05.59: And hopefully your book will help some of these MLOps people upskill themselves into this new world.

Let’s pivot quickly to agents. Obviously it’s a buzzword. Just like anything in the space, it means different things to different teams. So how do you distinguish agentic systems yourself?

06.24: There are two words in the space. One is agents; one is agent workflows. Basically agents are the components really. Or you can call them the model itself, but they’re trying to figure out what you meant, even if you forgot to tell them. That’s the core work of an agent. And the work of a workflow or the workflow of an agentic system, if you want to call it, is to tell these agents what to actually do. So one is responsible for execution; the other is responsible for the planning side of things. 

07.02: I think sometimes when tech journalists write about these things, the general public gets the notion that there’s this monolithic model that does everything. But the reality is, most teams are moving away from that design as you, as you describe.

So they have an agent that acts as an orchestrator or planner and then parcels out the different steps or tasks needed, and then maybe reassembles in the end, right?

07.42: Coming back to your point, it’s now less of a problem of machine learning. It’s, again, more like a distributed systems problem because we have multiple agents. Some of these agents will have more load—they will be the frontend agents, which are communicating to a lot of people. Obviously, on the GPUs, these need more distribution.

08.02: And when it comes to the other agents that may not be used as much, they can be provisioned based on “This is the need, and this is the availability that we have.” So all of that provisioning again is a problem. The communication is a problem. Setting up tests across different tasks itself within an entire workflow, now that becomes a problem, which is where a lot of people are trying to implement context engineering. But it’s a very complicated problem to solve. 

08.31: And then, Abi, there’s also the problem of compounding reliability. Let’s say, for example, you have an agentic workflow where one agent passes off to another agent and yet to another third agent. Each agent may have a certain amount of reliability, but it compounds over time. So it compounds across this pipeline, which makes it more challenging. 

09.02: And that’s where there’s a lot of research work going on in the space. It’s an idea that I’ve talked about in the book as well. At that point when I was writing the book, especially chapter four, in which a lot of these were described, most of the companies right now are [using] monolithic architecture, but it’s not going to be able to sustain as we go towards application.

We have to go towards a microservices architecture. And the moment we go towards microservices architecture, there are a lot of problems. One will be the hardware problem. The other is consensus building, which is. . . 

Let’s say you have three different agents spread across three different nodes, which would be running very differently. Let’s say one is running on an edge one hundred; one is running on something else. How can we achieve consensus if even one of the nodes ends up winning? So that’s open research work [where] people are trying to figure out, “Can we achieve consensus in agents based on whatever answer the majority is giving, or how do we really think about it?” It should be set up at a threshold at which, if it’s beyond this threshold, then you know, this perfectly works.

One of the frameworks that is trying to work in this space is called MassGen—they’re working on the research side of solving this problem itself in terms of the tool itself. 

10.31: By the way, even back in the microservices days in software architecture, obviously people went overboard too. So I think that, as with any of these new things, there’s a bit of trial and error that you have to go through. And the better you can test your systems and have a setup where you can reproduce and try different things, the better off you are, because many times your first stab at designing your system may not be the right one. Right? 

11.08: Yeah. And I’ll give you two examples of this. So AI companies tried to use a lot of agentic frameworks. You know people have used Crew; people have used n8n, they’ve used. . . 

11.25: Oh, I hate those! Not I hate. . . Sorry. Sorry, my friends and crew. 

11.30: And 90% of the people working in this space seriously have already made that transition, which is “We are going to write it ourselves. 

The same happened for evaluation: There were a lot of evaluation tools out there. What they were doing on the surface is literally just tracing, and tracing wasn’t really solving the problem—it was just a beautiful dashboard that doesn’t really serve much purpose. Maybe for the business teams. But at least for the ML engineers who are supposed to debug these problems and, you know, optimize these systems, essentially, it was not giving much other than “What is the error response that we’re getting to everything?”

12.08: So again, for that one as well, most of the companies have developed their own evaluation frameworks in-house, as of now. The people who are just starting out, obviously they’ve done. But most of the companies that started working with large language models in 2023, they’ve tried every tool out there in 2023, 2024. And right now more and more people are staying away from the frameworks and launching and everything.

People have understood that most of the frameworks in this space are not superreliable.

12.41: And [are] also, honestly, a bit bloated. They come with too many things that you don’t need in many ways. . .

12:54: Security loopholes as well. So for example, like I reported one of the security loopholes with LangChain as well, with LangSmith back in 2024. So those things obviously get reported by people [and] get worked on, but the companies aren’t really proactively working on closing those security loopholes. 

13.15: Two open source projects that I like that are not specifically agentic are DSPy and BAML. Wanted to give them a shout out. So this point I’m about to make, there’s no easy, clear-cut answer. But one thing I noticed, Abi, is that people will do the following, right? I’m going to take something we do, and I’m going to build agents to do the same thing. But the way we do things is I have a—I’m just making this up—I have a project manager and then I have a designer, I have role B, role C, and then there’s certain emails being exchanged.

So then the first step is “Let’s replicate not just the roles but kind of the exchange and communication.” And sometimes that actually increases the complexity of the design of your system because maybe you don’t need to do it the way the humans do it. Right? Maybe if you go to automation and agents, you don’t have to over-anthropomorphize your workflow. Right. So what do you think about this observation? 

14.31: A very interesting analogy I’ll give you is people are trying to replicate intelligence without understanding what intelligence is. The same for consciousness. Everybody wants to replicate and create consciousness without understanding consciousness. So the same is happening with this as well, which is we are trying to replicate a human workflow without really understanding how humans work.

14.55: And sometimes humans may not be the most efficient thing. Like they exchange five emails to arrive at something. 

15.04: And humans are never context defined. And in a very limiting sense. Even if somebody’s job is to do editing, they’re not just doing editing. They are looking at the flow. They are looking for a lot of things which you can’t really define. Obviously you can over a period of time, but it needs a lot of observation to understand. And that skill also depends on who the person is. Different people have different skills as well. Most of the agentic systems right now, they’re just glorified Zapier IFTTT routines. That’s the way I look at them right now. The if recipes: If this, then that.

15.48: Yeah, yeah. Robotic process automation I guess is what people call it. The other thing that people I don’t think understand just reading the popular tech press is that agents have levels of autonomy, right? Most teams don’t actually build an agent and unleash it full autonomous from day one.

I mean, I guess the analogy would be in self-driving cars: They have different levels of automation. Most enterprise AI teams realize that with agents, you have to kind of treat them that way too, depending on the complexity and the importance of the workflow. 

So you go first very much a human is involved and then less and less human over time as you develop confidence in the agent.

But I think it’s not good practice to just kind of let an agent run wild. Especially right now. 

16.56: It’s not, because who’s the person answering if the agent goes wrong? And that’s a question that has come up often. So this is the work that we’re doing at Abide really, which is trying to create a decision layer on top of the knowledge retrieval layer.

17.07: Most of the agents which are built using just large language models. . . LLMs—I think people need to understand this part—are fantastic at knowledge retrieval, but they do not know how to make decisions. If you think agents are independent decision makers and they can figure things out, no, they cannot figure things out. They can look at the database and try to do something.

Now, what they do may or may not be what you like, no matter how many rules you define across that. So what we really need to develop is some sort of symbolic language around how these agents are working, which is more like trying to give them a model of the world around “What is the cause and effect, with all of these decisions that you’re making? How do we prioritize one decision where the. . .? What was the reasoning behind that so that entire decision making reasoning here has been the missing part?”

18.02: You brought up the topic of observability. There’s two schools of thought here as far as agentic observability. The first one is we don’t need new tools. We have the tools. We just have to apply [them] to agents. And then the second, of course, is this is a new situation. So now we need to be able to do more. . . The observability tools have to be more capable because we’re dealing with nondeterministic systems.

And so maybe we need to capture more information along the way. Chains of decision, reasoning, traceability, and so on and so forth. Where do you fall in this kind of spectrum of we don’t need new tools or we need new tools? 

18.48: We don’t need new tools, but we certainly need new frameworks, and especially a new way of thinking. Observability in the MLOps world—fantastic; it was just about tools. Now, people have to stop thinking about observability as just visibility into the system and start thinking of it as an anomaly detection problem. And that was something I’d written in the book as well. Now it’s no longer about “Can I see what my token length is?” No, that’s not enough. You have to look for anomalies at every single part of the layer across a lot of metrics. 

19.24: So your position is we can use the existing tools. We may have to log more things. 

19.33: We may have to log more things, and then start building simple ML models to be able to do anomaly detection. 

Think of managing any machine, any LLM model, any agent as really like a fraud detection pipeline. So every single time you’re looking for “What are the simplest signs of fraud?” And that can happen across various factors. But we need more logging. And again you don’t need external tools for that. You can set up your own loggers as well.

Most of the people I know have been setting up their own loggers within their companies. So you can simply use telemetry to be able to a.) define a set and use the general logs, and b.) be able to define your own custom logs as well, depending on your agent pipeline itself. You can define “This is what it’s trying to do” and log more things across those things, and then start building small machine learning models to look for what’s going on over there.

20.36: So what is the state of “Where we are? How many teams are doing this?” 

20.42: Very few. Very, very few. Maybe just the top bits. The ones who are doing reinforcement learning training and using RL environments, because that’s where they’re getting their data to do RL. But people who are not using RL to be able to retrain their model, they’re not really doing much of this part; they’re still depending very much on external accounts.

21.12: I’ll get back to RL in a second. But one topic you raised when you pointed out the transition from MLOps to LLMOps was the importance of FinOps, which is, for our listeners, basically managing your cloud computing costs—or in this case, increasingly mastering token economics. Because basically, it’s one of these things that I think can bite you.

For example, the first time you use Claude Code, you go, “Oh, man, this tool is powerful.” And then boom, you get an email with a bill. I see, that’s why it’s powerful. And you multiply that across the board to teams who are starting to maybe deploy some of these things. And you see the importance of FinOps.

So where are we, Abi, as far as tooling for FinOps in the age of generative AI and also the practice of FinOps in the age of generative AI? 

22.19: Less than 5%, maybe even 2% of the way there. 

22:24: Really? But obviously everyone’s aware of it, right? Because at some point, when you deploy, you become aware. 

22.33: Not enough people. A lot of people just think about FinOps as cloud, basically the cloud cost. And there are different kinds of costs in the cloud. One of the things people are not doing enough is not profiling their models properly, which is [determining] “Where are the costs really coming from? Our models’ compute power? Are they taking too much RAM? 

22.58: Or are we using reasoning when we don’t need it?

23.00: Exactly. Now that’s a problem we solve very differently. That’s where yes, you can do kernel fusion. Define your own custom kernels. Right now there’s a massive number of people who think we need to rewrite kernels for everything. It’s only going to solve one problem, which is the compute-bound problem. But it’s not going to solve the memory-bound problem. Your data engineering pipelines aren’t what’s going to solve your memory-bound problems.

And that’s where most of the focus is missing. I’ve mentioned it in the book as well: Data engineering is the foundation of first being able to solve the problems. And then we moved to the compute-bound problems. Do not start optimizing the kernels over there. And then the third part would be the communication-bound problem, which is “How do we make these GPUs talk smarter with each other? How do we figure out the agent consensus and all of those problems?”

Now that’s a communication problem. And that’s what happens when there are different levels of bandwidth. Everybody’s dealing with the internet bandwidth as well, the kind of serving speed as well, different kinds of cost and every kind of transitioning from one node to another. If we’re not really hosting our own infrastructure, then that’s a different problem, because it depends on “Which server do you get assigned your GPUs on again?”

24.20: Yeah, yeah, yeah. I want to give a shout out to Ray—I’m an advisor to Anyscale—because Ray basically is built for these sorts of pipelines because it can do fine-grained utilization and help you decide between CPU and GPU. And just generally, you don’t think that the teams are taking token economics seriously?

I guess not. How many people have I heard talking about caching, for example? Because if it’s a prompt that [has been] answered before, why do you have to go through it again? 

25.07: I think plenty of people have started implementing KV caching, but they don’t really know. . . Again, one of the questions people don’t understand is “How much do we need to store in the memory itself, and how much do we need to store in the cache?” which is the big memory question. So that’s the one I don’t think people are able to solve. A lot of people are storing too much stuff in the cache that should actually be stored in the RAM itself, in the memory.

And there are generalist applications that don’t really understand that this agent doesn’t really need access to the memory. There’s no point. It’s just lost in the throughput really. So I think the problem isn’t really caching. The problem is that differentiation of understanding for people. 

25.55: Yeah, yeah, I just threw that out as one element. Because obviously there’s many, many things to mastering token economics. So you, you brought up reinforcement learning. A few years ago, obviously people got really into “Let’s do fine-tuning.” But then they quickly realized. . . And actually fine-tuning became easy because basically there became so many services where you can just focus on labeled data. You upload your labeled data, boom, come back from lunch, you have a fine-tuned model.

But then people realize that “I fine-tuned, but the model that results isn’t really as good as my fine-tuning data.” And then obviously RAG and context engineering came into the picture. Now it seems like more people are again talking about reinforcement learning, but in the context of LLMs. And there’s a lot of libraries, many of them built on Ray, for example. But it seems like what’s missing, Abi, is that fine-tuning got to the point where I can sit down a domain expert and say, “Produce labeled data.” And basically the domain expert is a first-class participant in fine-tuning.

As best I can tell, for reinforcement learning, the tools aren’t there yet. The UX hasn’t been figured out in order to bring in the domain experts as the first-class citizen in the reinforcement learning process—which they need to be because a lot of the stuff really resides in their brain. 

27.45: The big problem here, and very, very much to the point of what you pointed out, is the tools aren’t really there. And one very specific thing I can tell you is most of the reinforcement learning environments that you’re seeing are static environments. Agents are not learning statically. They are learning dynamically. If your RL environment cannot adapt dynamically, which basically in 2018, 2019, emerged as the OpenAI Gym and a lot of reinforcement learning libraries were coming out.

28.18: There is a line of work called curriculum learning, which is basically adapting your model’s difficulty to the results itself. So basically now that can be used in reinforcement learning, but I’ve not seen any practical implementation of using curriculum learning for reinforcement learning environments. So people create these environments—fantastic. They work well for a little bit of time, and then they become useless.

So that’s where even OpenAI, Anthropic, those companies are struggling as well. They’ve paid heavily in contracts, which are yearlong contracts to say, “Can you build this vertical environment? Can you build that vertical environment?” and that works fantastically But once the model learns on it, then there’s nothing else to learn. And then you go back into the question of, “Is this data fresh? Is this adaptive with the world?” And it becomes the same RAG problem over again. 

29.18: So maybe the problem is with RL itself. Maybe maybe we need a different paradigm. It’s just too hard. 

Let me close by looking to the future. The first thing is—the space is moving so hard, this might be an impossible question to ask, but if you look at, let’s say, 6 to 18 months, what are some things in the research domain that you think are not being talked enough about that might produce enough practical utility that we will start hearing about them in 6 to 12, 6 to 18 months?

29.55: One is how to profile your machine learning models, like the entire systems end-to-end. A lot of people do not understand them as systems, but only as models. So that’s one thing which will make a massive amount of difference. There are a lot of AI engineers today, but we don’t have enough system design engineers.

30.16: This is something that Ion Stoica at Sky Computing Lab has been giving keynotes about. Yeah. Interesting. 

30.23: The second part is. . . I’m optimistic about seeing curriculum learning applied to reinforcement learning as well, where our RL environments can adapt in real time so when we train agents on them, they are dynamically adapting as well. That’s also [some] of the work being done by labs like Circana, which are working in artificial labs, artificial light frame, all of that stuff—evolution of any kind of machine learning model accuracy. 

30.57: The third thing where I feel like the communities are falling behind massively is on the data engineering side. That’s where we have massive gains to get. 

31.09: So on the data engineering side, I’m happy to say that I advise several companies in the space that are completely focused on tools for these new workloads and these new data types. 

Last question for our listeners: What mindset shift or what skill do they need to pick up in order to position themselves in their career for the next 18 to 24 months?

31.40: For anybody who’s an AI engineer, a machine learning engineer, an LLMOps engineer, or an MLOps engineer, first learn how to profile your models. Start picking up Ray very quickly as a tool to just get started on, to see how distributed systems work. You can pick the LLM if you want, but start understanding distributed systems first. And once you start understanding those systems, then start looking back into the models itself. 

32.11: And with that, thank you, Abi.

💾

Why January Ventures is funding underrepresented AI founders

19 November 2025 at 13:15
While everyone’s chasing the next AI infrastructure play in San Francisco, some of the most defensible AI companies are being built by founders with deep expertise in legacy industries — and they’re not getting funded. January Ventures aims to fill that gap, writing pre-seed checks for underrepresented founders transforming healthcare, manufacturing, and supply chain with […]

Generative AI in the Real World: Laurence Moroney on AI at the Edge

13 November 2025 at 08:59

In this episode, Laurence Moroney, director of AI at Arm, joins Ben Lorica to chat about the state of deep learning frameworks—and why you may be better off thinking a step higher, on the solution level. Listen in for Laurence’s thoughts about posttraining; the evolution of on-device AI (and how tools like ExecuTorch and LiteRT are helping make it possible); why culturally specific models will only grow in importance; what Hollywood can teach us about LLM privacy; and more.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Laurence Moroney, director of AI at Arm and author of the book AI and ML for Coders in PyTorch. Laurence is someone I’ve known for a while. He was at Google serving as one of the main evangelists for TensorFlow. So welcome to the podcast, Laurence. 

00.23: Thank you Ben. It’s great to be here.

00.26: I guess, before we go on to the present, let’s talk about a little bit of the past of deep learning frameworks. In fact, this week is interesting because Soumith Chintala just announced he was leaving Meta, and Soumith was one of the leaders of the PyTorch project. I interviewed Soumith in an O’Reilly podcast after PyTorch was released, but coincidentally, exactly about a year before I interviewed Rajat Monga right around the time that TensorFlow was released. I was actually talking to these project leaders very early on. 

So, Laurence, you move your book to PyTorch, and I’m sure TensorFlow still holds a special place in your heart, right? So where does TensorFlow sit right now in your mind? Because right now it’s all about PyTorch, right? 

01.25: Yeah, that’s a great question. TensorFlow definitely has a very special place in my heart. I built a lot of my recent career on TensorFlow. I’ll be frank. It feels like there’s not that much investment in TensorFlow anymore.

If you take a look at even releases, it went 2.8, 2.9, 2.10, 2.11. . .and you know, there’s no 3.0 on the horizon. I can’t really share any insider stuff from Google, although I left there over a year ago, but it does feel that unfortunately [TensorFlow has] kind of withered on the vine a little bit internally at Google compared to JAX.

02.04: But then the problem, at least for me from an external perspective, is, first of all, JAX isn’t really a machine learning framework. There are machine learning frameworks that are built on top of it. And second of all, it’s not a 1.0 product. It’s hard for me to encourage anybody to bet their business or get their career on something that isn’t a 1.0 product, or at least a 1.0 product.

02.29: That really just leaves (by default) PyTorch. Obviously there’s been all of the momentum around PyTorch. There’s been all of the excitement around it. It’s interesting, though, that if you look at things like GitHub star history, it still lags behind both TensorFlow and JAX. But in perception it is the most popular. And unfortunately, if you do want to build a career now on creating machine learning models, not just using machine learning models, it’s really the—oh well, I shouldn’t say unfortunately. . . The truth is that it’s really the only option. So that’s the negative side. 

The positive side of it is of course, it’s really, really good. I’ve been using it extensively for some time. Even during my TensorFlow and JAX days, I did use PyTorch a lot. I wanted to keep an eye on how it was used, how it’s shaped, what worked, what didn’t, the best way for somebody to learn how to learn using PyTorch—and to make sure that the TensorFlow community, as I was working on it, were able to keep up with the simplicity of PyTorch, particularly the brilliant work that was done by the Keras team to really make Keras part of TensorFlow. It’s now been kind of pulled aside, pulled out of TensorFlow somewhat, but that was something that leaned into the same simplicity as PyTorch.

03.52: And like I said, now going forward, PyTorch is. . . I rewrote my book to be PyTorch specific. Andrew and I are teaching a PyTorch specialization with deep learning AI in Coursera. And you know, if my emphasis is less on frameworks and framework wars and loyalties and stuff like that and more on, I really want to help people to succeed, to build careers or to build startups, that kind of thing, that this was the direction that I think it should go in. 

04.19: Now, maybe I’m wrong, but I think even about two years ago, maybe a little more than that, I was still hearing and seeing job posts around TensorFlow, primarily around people working in computer vision on edge devices. So is that still a place where you would run into TensorFlow users?

04.41: Absolutely, yes. Because of what was previously called TensorFlow Lite and is now called LiteRT as a runtime for models to be able to run on edge devices. I mean, that really was the only option until recently— just last week at the PyTorch Summit, ExecuTorch went 1.0. And if I go back to my old mantra of “I really don’t want anybody to invest their business or their career on something that’s prerelease,” it’s good to learn and it’s good to prepare.

05.10: [Back] then, the only option for you to be able to train models and deploy them, particularly to mobile devices, was effectively either LiteRT or TensorFlow Lite or whatever it’s called now, or Core ML for Apple devices. But now with ExecuTorch going 1.0, the whole market is out there for PyTorch developers to be able to deploy to mobile and edge devices.

05.34: So those job listings, I think as they evolve and as they go forward that the skills may kind of veer more towards PyTorch, but I’d also encourage everybody to kind of double click above the framework level and start thinking on the solution level. There’ve been a lot of framework wars in so many things, you know, Mac versus PC, Darknet versus Java. And in some ways, that’s not the most productive way of thinking about things.

I think the best thing to do is [to] think about what’s out there to allow you to build a solution that you can deploy, that you can trust, and that will be there for some time. And let the framework be secondary to that. 

06.14: All right. So one last framework question. And this is also an observation that might be slightly dated—I think this might be from around two years ago. I was actually surprised that, for some reason, I think the Chinese government is also encouraging Chinese companies to use local deep learning frameworks. So it’s not just PaddlePaddle. There’s another one that I came across and I don’t know what’s the status of that now, as far as you know. . .

06.43: So I’m not familiar with any others other than PaddlePaddle. But I do generally agree with [the idea that] cultures should be thinking about using tools and frameworks and models that are appropriate for their culture. I’m going to pivot away from frameworks towards large language models as an example. 

Large language models are primarily built on English. And when you start peeling apart large language models and look at what’s underneath the hood and particularly how they tokenize words, it’s very, very English oriented. So if you start wanting to build solutions, for example, for things like education—you know, important things!—and you’re not primarily an English language-speaking country, you’re already a little bit behind the curve.

07.35: Actually, I just came from a meeting with some folks from Ireland. And for the Gaelic language, the whole idea of posttraining models that were trained primarily with English tokens is already setting you apart at a disadvantage if you’re trying to build stuff that you can use within your culture.

At the very least, missing tokens, right? There were subwords in Gaelic that don’t exist in English, or subwords in Japanese or Chinese or Korean or whatever that don’t exist in English. So if you start even trying to do posttraining, you realize that the model was trained on using tokens that are. . . You need to use tokens that the model wasn’t trained with and stuff like that.

So I know I’m not really answering the framework part of it, but I do think it’s an important thing, like you mentioned, that China wants to invest in their own frameworks. But I think every culture should also be looking at. . . Cultural preservation is very, very important in the age of AI, as we build more dependence on AI. 

08.37: When it comes to a framework, PyTorch is open source. TensorFlow is open source. I’m pretty sure PaddlePaddle is open source. I don’t know. I’m not really that familiar with it. So you don’t have the traps of being locked into somebody else’s cultural perspective or language or anything like that, that you would have with an obscure large language model if you’re using an open source framework. So that part isn’t as difficult when it comes to, like, a country wanting to adopt a framework. But certainly when it comes to building on top of pretrained models, that’s where you need to be careful.

09.11: So [for] most developers and most enterprise AI teams, the reality is they’re not going to be pretraining. So it’s mostly about posttraining, which is a big topic. It can run the gamut of RAG, fine-tuning, reinforcement learning, distillation, quantization. . . So from that perspective, Laurence, how much should someone who’s in an enterprise AI team really know about these deep learning frameworks?

09.42: So I think two different things there, right? One is posttraining and one is deep learning frameworks. I’m going to lean into the posttraining side to argue that that’s the single number one important skill for developers going forward: posttraining and all of their types of code.

10.00: And all of the types of posttraining.

10.01: Yeah, totally. There’s always trade-offs, right? There’s the very simple posttraining stuff like RAG, which is relatively low value, and then there’s the more complex stuff like a full retrain or a LoRA-type training, which is more expensive or more difficult but has higher value. 

But I think there’s a whole spectrum of ways of doing things with posttraining. And my argument that I’m making very passionately is that if you’re a developer, that is the number one skill to learn going forward. “Agents” was kind of the buzzword of 2025; I think “small AI” will be the buzzword of 2026. 

10.40: We often talk about open source AI with open source models and stuff like that. It’s not really open source. It’s a bit of a misnomer. The weights have been released for you to be able to use and self-host them—if you want a self-hosted chatbot or self-host something that you want to run on them. 

But more importantly, the weights are there for you to change, through retraining, through fine-tuning and stuff like that. I’m particularly passionate about that because when you start thinking in terms of two things—latency and privacy—it becomes really, really important. 

11.15: I spent a lot of time working with folks who are passionate about IP. I’ll share one of them: Hollywood movie studios. And we’ve probably all seen those semi-frivolous lawsuits of, person A makes a movie, and then person B sues person A because person B had the idea first. And movie studios are generally terrified of that kind of thing. 

I actually have a movie in preproduction with a studio at the moment. So I’ve learned a lot through that. And one of the things [I learned] was, even when I speak with producers or the financiers, a lot of time we talk on the phone. We don’t email or anything like that because the whole fear of IP leaks is out there, and this has led to a fear there of, think of all the things that an LLM could be used to [do]. The shallow stuff would be to help you write scenes and all that kind of stuff. But most of them don’t really care about that. 

The more important things where an LLM could be used [are it could] evaluate a script and count the number of locations that would be needed to film this script. Like the Mission:Impossible script, where one scene’s in Paris and another scene’s in Moscow, and another scene is in Hong Kong. To be able to have a machine that can evaluate that and help you start budgeting. Or if somebody sends in a speculative script with all of that kind of stuff in it, and you realize you don’t have half a billion to make this movie from an unknown, because they have all these locations.

12.41: So all of this kind of analysis that can be done—story analysis, costing analysis, and all of that type of stuff—is really important to them. And it’s great low-hanging fruit for something like an LLM to do. But there’s no way they’re going to upload their speculative scripts to Gemini or OpenAI or Claude or anything like that.

So local AI is really important to them—and the whole privacy part of it. You run the model and the machine; you do the analysis on the machine; the data never leaves your laptop. And then extend that. I mean, not everybody’s going to be working with Hollywood studios, but extend that to just general small offices—your law office, your medical office, your physiotherapists, or whatever [where] everybody is using large language models for very creative things, but if you can make those models far more effective at your specific domain. . .

13.37: I’ll use a small office, for example, in a particular state in a particular jurisdiction, to be able to retrain a model, to be an expert in the law for that jurisdiction based on prior, what is it they call it? Jury priors? I can’t remember the Latin phrase for it, but, you know, based on precedents. To be able to fine-tune a model for that and then have everything locally within your office so you’re not sharing out to Claude or Gemini or OpenAI or whatever. Developers are going to be building that stuff. 

14.11: And with a lot of fear, uncertainty and doubt out there for developers with code generation, the optimist in me is seeing that [for] developers, your value bar is actually raising up. If your value is just your ability to churn out code, now models can compete with you. But if you’re raising the value of yourself to being able to do things that are much higher value than just churning out code—and I think fine-tuning is a part of that—then that actually leads to a very bright future for developers.

14.43: So here’s my impression of the state of tooling for posttraining. So [with] RAG and different variants of RAG, it seems like people have enough tools or have tools or have some notion of how to get started. [For] fine-tuning, there’s a lot of services that you can use now, and it mainly comes down to collecting a fine-tuning dataset it seems like.

[For] reinforcement learning, we still need tools that are accessible. The workflow needs to be at a point where a domain expert can actually do it—and that’s in some ways kind of where we are in fine-tuning, so the domain expert can focus on the dataset. Reinforcement learning, not so much the case. 

I don’t know, Laurence, if you would consider quantization and distillation part of posttraining, but it seems like that might also be something where people would also need more tools. More options. So what’s your sense of tooling for the different types of posttraining?

15.56: Good question. I’ll start with RAG because it’s the easiest. There’s obviously lots of tooling out there for it. 

16.04: And startups, right? So a lot of startups. 

16.07: Yep. I think the thing with RAG that interests me and fascinates me the most is in some ways it shares [similarities] with the early days of actually doing machine learning with the likes of Keras or PyTorch or TensorFlow, where there’s a lot of trial and error. And, you know, the tools.

16.25: Yeah, there’s a lot there’s a lot of knobs that you can optimize. People underestimate how important that is, right? 

16.35: Oh, absolutely. Even the most basic knob, like, How big a slice do you take of your text, and how big of an overlap do you do between those slices? Because you can have vastly different results by doing that. 

16.51: So just as a quick recap from if anybody’s not familiar with RAG, I’d like to give one little example of it. I actually wrote a novel about 12, 13 years ago, and six months after the novel was published, the publisher went bust. And this novel is not in the training set of any LLM.

So if I go to an LLM like Claude or GPT or anything like that and I ask about the novel, it will usually either say it doesn’t know or it will hallucinate and it’ll make stuff up and say it knows it. So to me, this was the perfect thing for me to try RAG. 

17.25: The idea with RAG is that I will take the text of the novel and I’ll chop it up into maybe 20-word increments, with five-word overlap—so the first 20 words of the book and then word 15 through 35 and then word 30 through 50 so you get those overlaps—and then store those into a vector database. And then when somebody wants to ask about something like maybe ask about a character in the novel, then the prompts will be vectorized, and the embeddings for that prompt can be compared with the embeddings of all of these chunks. 

And then when similar chunks are found, like the name of the character and stuff like that, or if the prompt asks, “Tell me about her hometown,” then there may be a chunk in the book that says, “Her hometown is blah,” you know?

So they will then be retrieved from the database and added to the prompt, and then sent to something like GPT. So now GPT has much more context: not just the prompt but also all these extra bits that it retrieves from the book that says, “Hey, she’s from this town and she likes this food.” And while ChatGPT doesn’t know about the book, it does know about the town, and it does know about that food, and it can give a more intelligent answer. 

18.34: So it’s not really a tuning of the model in any way or posttuning of the model, but it’s an interesting and really nice hack to allow you to get the model to be able to do more than you thought it could do. 

But going back to the question about tooling, there’s a lot of trial and error there like “How do I tokenize the words? What kind of chunk size do I use?” And all of that kind of stuff. So anybody that can provide any kind of tooling in that space so that you can try multiple databases and compare them against each other, I think is really valuable and really, really important.

19.05: If I go to the other end of the spectrum, then for actual real tuning of a model, I think LoRA tuning is a good example there. And tooling for that is hard to find. It’s few and far between. 

19:20: I think actually there’s a lot of providers now where you can focus on your dataset and then. . . It’s a bit of a black box, obviously, because you’re relying on an API. I guess my point is that even if you’re [on] a team where you don’t have that expertise, you can get going. Whereas in reinforcement learning, there’s really not much tooling out there. 

19:50: Certainly with reinforcement learning, you got to kind of just crack open the APIs and start coding. It’s not as difficult as it sounds, once you start doing it.

20:00: There are people who are trying to build tools, but I haven’t seen one where you can just point the domain expert. 

20.09: Totally. And I would also encourage [listeners that] if you’re doing any other stuff like LoRA tuning, it’s really not that difficult once you start looking. And PyTorch is great for this, and Python is great for this, once you start looking at how to do it. Shameless self-plug here, but [in] the final chapter of my PyTorch book, I actually give an example of LoRA tuning, where I created a dataset for a digital influencer and I show you how to retune and how to LoRA-tune the Stable Diffusion model to be a specialist in creating for this one particular individual—just to show how to do all of that in code.

Because I’m always a believer that before I start using third-party tools to do a thing, I kind of want to look at the code and the frameworks and how to do that thing for myself. So then I can really understand the value that the tools are going to be giving me. So I tend to veer towards “Let me code it first before I care about the tools.”

21.09: Spoken like a true Googler. 

21.15: [laughs] I have to call that one tool that, while it’s not specifically for fine-tuning large language models, I hope they converted for it. But this one changed the game for me: Apple has a tool called Create ML, which was really used for transfer learning off of existing models—which is still posttraining, just now posttraining of LLMs.

And that tool’s ability to be able to take a dataset and then to fine-tune a model like a MobileNet or something, or an object detection model on that codelessly and efficiently blew my mind with how good it was. The world needs more tooling like that. And if there’s any Apple people listening, I’d encourage them to extend Create ML for large language models or for any other generative models.

22.00: By the way, I want to make sure, as we wind down, I ask you about edge—that’s what’s occupying you at the moment. You talk about this notion of “build once, deploy everywhere.” So what’s actually feasible today? 

22.19: So what’s feasible today? I think the best multideployment surface today that I would invest in going forward is creating for ExecuTorch, because ExecuTorch runtime is going to be living in so many places. 

At Arm, obviously we’ve been working very closely with ExecuTorch and we are part of the ExecuTorch 1.0 release. But if you’re building for edge, you know, to make sure that your models work on the ExecuTorch, which, I think would be the number one, low-hanging fruit that I would say that people would invest in. So that’s PyTorch’s model.

22.54: Does it really live up to the “run everywhere”?

23.01: Define “everywhere.”

23.02: [laughs] I guess, at the minimum, Android and iOS. 

23.12: So yes, at a minimum, for those—the same as LiteRT or TensorFlow Lite from Google does. What I’m excited about with ExecuTorch is that it also runs in other physical AI areas. We are going to be seeing it in cars and robots and other things as well. And I anticipate that that ecosystem will spread a lot faster than the Lite or T1. So if you’re starting with Android and iOS, then you’re in good shape. 

23.42: What about the kinds of devices that our mutual friend Pete Warden, for example, targets? The really compute-hungry [ones]? Well, not so much compute hungry, but basically not much compute.

24.05: They sip power rather than gulping it. I think that would be a better question for Pete than for me. If you see him, tell him I said hi. 

24.13: I mean, is that something that the ExecuTorch community also kind of thinks about?

24.22: At short. Yes. In long, that’s a bit more of a challenge to go on microcontrollers and the like. One of the things that when you start getting down onto the small that I’m really excited about is a technology called SME, which is scalable matrix extensions. And it’s something that Arm have been working on with various chip makers and handset makers, with the idea being that SME is all about being able to run AI workloads on the CPU. So without needing a separate external accelerator. And then as a result, the CPU’s going to be drawing less battery, those kinds of things, etc. 

That’s one of the growth areas that I’m excited about, where you’re going to see more and more AI workloads being able to run on handsets, particularly the diverse Android handsets, because the CPU is capable of running models instead of you needing to offload to a separate accelerator, being an NPU or a TPU or GPU.

And the problem with the Android ecosystem is the sheer diversity makes it difficult for a developer to target any specific one. But if more and more workloads can actually move on to the CPU, and every device has a CPU, then the idea of being able to do more and more AI workloads through SME is going to be particularly exciting.

25.46: So actually, Laurence, for people who don’t work on edge deployments, give us a sense of how capable some of these small models are. 

First I’ll throw out an unreasonable example: coding. So obviously, me and many people love all these coding tools like Claude Code, but sometimes it really consumes a lot of compute, gets expensive. And not only that, you end up getting somewhat dependent so that you have to always be connected to the cloud. So if you are on a plane, suddenly you’re not as productive anymore. 

So I’m sure in coding it might not be feasible, but what are these language models or these foundation models capable of doing locally [on smartphones, for example] that people may not be aware of?

26.47: Okay, so let me kind of answer that in two different ways: [what] device foundation models are capable of that people may not be aware of [and] the overall on-device ecosystem and the kind of things you can do that people may not be aware of. And I’m going to start with the second one.

You mentioned China earlier on. Alipay is a company from China, and they’ve been working on the SME technology that I spoke about, where they had an app, which I’m sure we’ve all seen these kind of apps where you can get your vacation photographs and then you can search your vacation photographs for things, like “Show me all the pictures I took with a panda.”

And then you can create a slideshow or a subset of your folder with that. But when you build something like that, the AI required to be able to search images for a particular thing needs to live in the cloud because on-device just wasn’t capable of doing that type of image-based searching previously.

27.47: So then as a company, they had to stand up a cloud service to be able to do this. As a user, I had privacy and latency issues if I was using this: I have to share all of my photos with a third party and whatever I’m looking for in those photos I have to share with the third party.

And then of course, there’s the latency: I have to send the query. I have to have the query execute in the cloud. I have to have the results come back to my device and then be assembled on my device. 

28.16: Now with an on-device AI, thinking about it from both the user perspective and from the app vendor perspective, it’s a better experience. I’ll start from the app vendor perspective: They don’t need to stand up this cloud service anymore, so they’re saving a lot of time and effort and money because everything is moving on-device. And with a model that’s capable of understanding images, and understanding the contents of images so that you can search for those, executing completely on-device.

The user experience is also better. Show me all the pictures of pandas that I have where it’s able to search the device for those pictures or look through all the pictures on the device, get an embedding that represents the contents of that picture map that match that embedding to the query that the user is doing, and then assemble those pictures. So you don’t have the latency, and you don’t have the privacy issues, and the vendor doesn’t have to stand up stuff.

29.11: So that’s the kind of area where I’m seeing great improvements, not just in user experience but also making it much cheaper and easier for somebody to build these applications—and all of that then stems from the capabilities of foundation models that are executing on the device, right? In this case, it’s a model that’s able to turn an image into a set of embeddings so that you can search those embeddings for matching things.

As a result, we’re seeing more and more on-device models, like Gemini Nano, like Apple Intelligence, becoming a foundational part of the operating system. Then more and more will be able to see applications like these being made possible. 

I can’t afford to stand up a cloud service. You know, it’s costing millions of dollars to be able to build an application for somebody, so I can’t do that. And how many small startups can’t do that? But then as it moves on-device, and you don’t need all of that, and it’s just going to be purely an on-device thing, then suddenly it becomes much more interesting. And I think there’ll be a lot more innovation happening in that space. 

30.16: You mentioned Gemma. What are the key families of local foundation models?

30.27: Sure. So, there’s local foundation models, and then also embedded on-device models. So Gemini Nano and Android and the Apple Intelligence models and Apple, as well as this ecosystem of smaller models that could work either on-device or on your desktop, like the Gemma family from Google. There’s the OpenAI gpt-oss, there’s the Qwen stuff from China, there’s Llama, you know that there’s a whole bunch of them out there.

I’ve recently been using the gpt-oss, which I find really good. And obviously I’m also a big fan of Gemma, but there’s lots of families out there—there’s so many new ones coming online every day, it seems. So there’s a lot of choice for those, but many of them are still too big to work on a mobile device.

31.15: You brought up quantization earlier on. And that’s where quantization will have to come into play, at least in some cases. But I think for the most part, if you look at where the vectors are trending, the smaller models are getting smarter. So what the 7 billion-parameter model can do today you needed 100 billion parameters to do two years ago.

And you keep projecting that forward, like the 1 billion-parameter model’s kind of [going to] be able to do the same thing in a year or two time, and then it becomes relatively trivial to put them onto a mobile device if they’re not part of the core operating system, but for them to be something that you ship along with your application.

I can see more and more of that happening where third-party models being small enough to work on mobile devices will become the next wave of what I’ve been calling small AI, not just on mobile but also on desktop and elsewhere. 

32.13: So in closing, Laurence, for our listeners who are already familiar and may already be building AI applications for cloud or enterprise, this conversation may prompt them to start checking out edge and local applications.

Besides your book and your blog, what are some of the key resources? Are there specific conferences where a lot of these local AI edge AI people gather, for example? 

32.48: So local AI, not yet. I think that that wave is only just beginning. Obviously things like the Meta conferences, we’ll talk a lot about Llama; Google conferences, we’ll talk a lot about Gemma; but an independent conference for just general local AI as a whole, I think that wave is only just beginning.

Mobile is very vendor specific or [focused on] the ecosystem of a vendor. Apple obviously have their WWDC, Google have their conferences, but there’s also the independent conference called droidcon, which I find really, really good for understanding mobile and understanding AI on mobile, particularly for the Android ecosystem.

But as for an overall conference for small AI and for the ideas of fine-tuning, all of the types of posttuning small AI that can be done, that’s that’s a growth area. I would say for posttraining, there’s a really excellent Coursera course that a friend of mine, Sharon Zhou, just released. It just came out last week or the week before. That’s an excellent course in all of the ins and outs of posttraining fine-tuning. But, yeah, I think it’s a great growth area.

34.08: And for those of us who are iPhone users. . . I keep waiting for Apple Intelligence to really up its game. It seems like it’s getting close. They have multiple initiatives in the works. They have alliances with OpenAI and now with Google. But then apparently they’re also working on their own model. So any inside scoop? [laughs]

34.33: Well, no inside scoop because I don’t work at Apple or anything like that, but I’ve been using Apple Intelligence quite a lot, and I’m a big fan. The ability to have the on-device large language model is really powerful. There’s a lot of scenarios I’ve been kind of poking around with and helping some startups with in that space. 

The one thing that I would say that’s a big gotcha for developers to look out for is the very small context window. It’s only 8K, so if you try to do any kind of long-running stuff or anything interesting like that, you’ve got to go off-device. Apple have obviously been investing in this private cloud so that your sessions, when they go off-device into the cloud. . . At least they try to solve the privacy part of it. They’re getting ahead of the privacy [issue] better than anybody else, I think. 

But latency is still there. And I think that deal with Google to provide Gemini services that was announced a couple of days ago is more on that cloud side of things and less on the on-device. 

35.42: But going back to what I was saying earlier on, the 7 billion-parameter model of today is as good as the 120 billion of yesterday. The 1 billion-parameter [model] of next year is probably as good as that, if not better. So, as smaller parameter-size models and therefore memory space models are becoming much more effective, I can see more of them being delivered on-device as part of the operating system, in the same way as Apple Intelligence are doing it. But hopefully with a bigger context window because they can afford it with the smaller model. 

36.14: And to clarify, Laurence, that trend that you just pointed out, the increasing capability of the smaller models, that holds not just for LLMs but also for multimodal? 

36.25: Yes. 

36.26: And with that, thank you, Laurence. 

36.29: Thank you, Ben. Always a pleasure.

💾

Real revenue, actual value, and a little froth: Read AI CEO David Shim on the emerging AI economy

15 November 2025 at 10:30
Read AI CEO David Shim discusses the state of the AI economy in a conversation with GeekWire co-founder John Cook during a recent Accenture dinner event for the “Agents of Transformation” series. (GeekWire Photo / Holly Grambihler)

[Editor’s Note: Agents of Transformation is an independent GeekWire series and 2026 event, underwritten by Accenture, exploring the people, companies, and ideas behind the rise of AI agents.]

What separates the dot-com bubble from today’s AI boom? For serial entrepreneur David Shim, it’s two things the early internet never had at scale: real business models and customers willing to pay.

People used the early internet because it was free and subsidized by incentives like gift certificates and free shipping. Today, he said, companies and consumers are paying real money and finding actual value in AI tools that are scaling to tens of millions in revenue within months.

But the Read AI co-founder and CEO, who has built and led companies through multiple tech cycles over the past 25 years, doesn’t dismiss the notion of an AI bubble entirely. Shim pointed to the speculative “edges” of the industry, where some companies are securing massive valuations despite having no product and no revenue — a phenomenon he described as “100% bubbly.”

He also cited AMD’s deal with OpenAI — in which the chipmaker offered stock incentives tied to a large chip purchase — as another example of froth at the margins. The arrangement had “a little bit” of a 2000-era feel of trading, bartering and unusual financial engineering that briefly boosted AMD’s stock.

But even that, in his view, is more of an outlier than a systemic warning sign.

“I think it’s a bubble, but I don’t think it’s going to burst anytime soon,” Shim said. “And so I think it’s going to be more of a slow release at the end of the day.”

Shim, who was named CEO of the Year at this year’s GeekWire Awards, previously led Foursquare and sold the startup Placed to Snap. He now leads Read AI, which has raised more than $80 million and landed major enterprise customers for its cross-platform AI meeting assistant and productivity tools.

He made the comments during a wide-ranging interview with GeekWire co-founder John Cook. They spoke about AI, productivity, and the future of work at a recent dinner event hosted in partnership with Accenture, in conjunction with GeekWire’s new “Agents of Transformation” editorial series.

We’re featuring the discussion on this episode of the GeekWire Podcast. Listen above, and subscribe to GeekWire in Apple Podcasts, Spotify, or wherever you listen. Continue reading for more takeaways.

Successful AI agents solve specific problems: The most effective AI implementations will be invisible infrastructure focused on particular tasks, not broad all-purpose assistants. The term “agents” itself will fade into the background as the technology matures and becomes more integrated.

Human psychology is shaping AI deployment: Internally, ReadAI is testing an AI assistant named “Ada” that schedules meetings by learning users’ communication patterns and priorities. It works so quickly, he said, that Read AI is building delays into its responses, after finding that quick replies “freak people out,” making them think their messages didn’t get a careful read.

Global adoption is happening without traditional localization: Read AI captured 1% of Colombia’s population without local staff or employees, demonstrating AI’s ability to scale internationally in ways previous technologies couldn’t.

“Multiplayer AI” will unlock more value: Shim says an AI’s value is limited when it only knows one person’s data. He believes one key is connecting AI across entire teams, to answer questions by pulling information from a colleague’s work, including meetings you didn’t attend and files you’ve never seen.

“Digital Twins” are the next, controversial frontier: Shim predicts a future in which a departed employee can be “resurrected” from their work data, allowing companies to query that person’s institutional knowledge. The idea sounds controversial and “a little bit scary,” he said, but it could be invaluable for answering questions that only the former employee would have known.

Subscribe to GeekWire in Apple Podcasts, Spotify, or wherever you listen.

Hackaday Podcast Episode 345: A Stunning Lightsaber, Two Extreme Cameras, and Wrangling Roombas

14 November 2025 at 12:00

It’s a wet November evening across Western Europe, the steel-grey clouds have obscured a rare low-latitude aurora this week, and Elliot Williams is joined by Jenny List for this week’s podcast. And we’ve got a fine selection for your listening pleasure!

The 2025 Component Abuse Challenge has come to an end, so this week you’ll be hearing about a few of the entries. We’ve received an impressive number, and as always we’re bowled over by the ingenuity of Hackaday readers in pushing parts beyond their limits.

In the news is the potential discovery of a lost UNIX version in a dusty store room at the University of Utah, Version 4 of the OS, which appeared in 1973. Check out your own stores, for hidden nuggets of gold. In the hacks, we have two cameras at the opposite end of the resolution spectrum, but sharing some impressive reverse engineering. Mouse cameras and scanner cameras were both a thing a couple of decades ago, and it’s great to see people still pushing the boundaries. Then we look at the challenge of encoding Chinese text as Morse code, an online-upgraded multimeter, the art of making lenses for an LED lighting effect, and what must be the best recreation of a Star Wars light sabre we have ever seen. In quick hacks we have a bevvy of Component Abuse Challenge projects, a Minecraft server on a smart light bulb, and a long term test of smartphone battery charging techniques.

We round off with a couple of our long-form pieces, first the uncertainties about iRobot’s future and what it might mean for their ecosystem — think: cheap hackable robotics platform! — and then a look at FreeBSD as an alternative upgrade path for Windows users. It’s a path not without challenges, but the venerable OS still has plenty to give.

As always, you can listen using the links below, and we’ve laidout links to all the articles under discussion at the bottom of the page.

Download our finest MP3 right here.

Where to Follow Hackaday Podcast

Episode 345 Show Notes:

News:

What’s that Sound?

Interesting Hacks of the Week:

Quick Hacks:

Can’t Miss Articles:

Generative AI in the Real World: Chris Butler on GenAI in Product Management

30 October 2025 at 07:29

In this episode, Ben Lorica and Chris Butler, director of product operations for GitHub’s Synapse team, chat about the experimentation Chris is doing to incorporate generative AI into the product development process—particularly with the goal of reducing toil for cross-functional teams. It isn’t just automating busywork (although there’s some of that). He and his team have created agents that expose the right information at the right time, use feedback in meetings to develop “straw man” prototypes for the team to react to, and even offer critiques from specific perspectives (a CPO agent?). Very interesting stuff.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: Today we have Chris Butler of GitHub, where he leads a team called the Synapse. Welcome to the podcast, Chris. 

00.15: Thank you. Yeah. Synapse is actually part of our product team and what we call EPD operations, which is engineering, product, and design. And our team is mostly engineers. I’m the product lead for it, but we help solve and reduce toil for these cross-functional teams inside of GitHub, mostly building internal tooling, with the focus on process automation and AI. But we also have a speculative part of our practice as well: trying to imagine the future of cross-functional teams working together and how they might do that with agents, for example.

00.45: Actually, you are the first person I’ve come across who’s used the word “toil.” Usually “tedium” is what people use, in terms of describing the parts of their job that they would rather automate. So you’re actually a big proponent of talking about agents that go beyond coding agents.

01.03: Yeah. That’s right. 

01.05: And specifically in your context for product people. 

01.09: And actually, for just the way that, say, product people work with their cross-functional teams. But I would also include other types of functions, legal privacy and customer support docs, any of these people that are working to actually help build a product; I think there needs to be a transformation of the way we think about these tools.

01.29: GitHub is a very engineering-led organization as well as a very engineering-focused organization. But my role is to really think about “How do we do a better job between all these people that I would call nontechnical—but they are sometimes technical, of course, but the people that are not necessarily there to write code. . . How do we actually work together to build great products?” And so that’s what I think about work. 

01.48: For people who aren’t familiar with product management and product teams, what’s toil in the context of product teams? 

02.00: So toil is actually something that I stole from a Google SRE from the standpoint of any type of thing that someone has to do that is manual, tactical, repetitive. . . It usually doesn’t really add to the value of the product in any way. It’s something that as the team gets bigger or the product goes down the SDLC or lifecycle, it scales linearly, with the fact that you’re building bigger and bigger things. And so it’s usually something that we want to try to cut out, because not only is it potentially a waste of time, but there’s also a perception within the team it can cause burnout.

02.35: If I have to constantly be doing toilsome parts of my work, I feel I’m doing things that don’t really matter rather than focusing on the things that really matter. And what I would argue is especially for product managers and cross-functional teams, a lot of the time that is processes that they have to use, usually to share information within larger organizations.

02.54: A good example of that is status reporting. Status reporting is one of those things where people will spend anywhere from 30 minutes to hours per week. And sometimes it’s in certain parts of the team—technical product managers, product managers, engineering managers, program managers are all dealing with this aspect that they have to in some way summarize the work that the team is doing and then shar[e] that not only with their leadership. . . They want to build trust with their leadership, that they’re making the right decisions, that they’re making the right calls. They’re able to escalate when they need help. But also then to convey information to other teams that are dependent on them or they’re dependent on. Again, this is [in] very large organizations, [where] there’s a huge cost to communication flows.

03.35: And so that’s why I use status reporting as a good example of that. Now with the use of the things like LLMs, especially if we think about our LLMs as a compression engine or a translation engine, we can then start to use these tools inside of these processes around status reporting to make it less toilsome. But there’s still aspects of it that we want to keep that are really about humans understanding, making decisions, things like that. 

03:59: And this is key. So one of the concerns that people have is about a hollowing out in the following context: If you eliminate toil in general, the problem there is that your most junior or entry-level employees actually learn about the culture of the organization by doing toil. There’s some level of toil that becomes part of the onboarding in the acculturation of young employees. But on the other hand, this is a challenge for organizations to just change how they onboard new employees and what kinds of tasks they give them and how they learn more about the culture of the organization.

04.51: I would differentiate between the idea of toil and paying your dues within the organization. In investment banking, there’s a whole concern about that: “They just need to sit in the office for 12 hours a day to really get the culture here.” And I would differentiate that from. . .

05:04: Or “Get this slide to pitch decks and make sure all the fonts are the right fonts.”

05.11: That’s right. Yeah, I worked at Facebook Reality Labs, and there were many times where we would do a Zuck review, and getting those slides perfect was a huge task for the team. What I would say is I want to differentiate this from the gaining of expertise. So if we think about Gary Klein, naturalistic decision making, real expertise is actually about being able to see an environment. And that could be a data environment [or] information environment as well. And then as you gain expertise, you’re able to discern between important signals and noise. And so what I’m not advocating for is to remove the ability to gain that expertise. But I am saying that toilsome work doesn’t necessarily contribute to expertise. 

05.49: In the case of status reporting as an example—status reporting is very valuable for a person to be able to understand what is going on with the team, and then, “What actions do I need to take?” And we don’t want to remove that. But the idea that a TPM or product manager or EM has to dig through all of the different issues that are inside of a particular repo to look for specific updates and then do their own synthesis of a draft, I think there is a difference there. And so what I would say is that the idea of me reading this information in a way that is very convenient for me to consume and then to be able to shape the signal that I then put out into the organization as a status report, that is still very much a human decision.

06.30: And I think that’s where we can start to use tools. Ethan Mollick has talked about this a lot in the way that he’s trying to approach including LLMs in, say, the classroom. There’s two patterns that I think could come out of this. One is that when I have some type of early draft of something, I should be able to get a lot of early feedback that is very low reputational risk. And what I mean by that is that a bot can tell me “Hey, this is not written in a way with the active voice” or “[This] is not really talking about the impact of this on the organization.” And so I can get that super early feedback in a way that is not going to hurt me.

If I publish a really bad status report, people may think less of me inside the organization. But using a bot or an agent or just a prompt to even just say, “Hey, these are the ways you can improve this”—that type of early feedback is really, really valuable. That I have a draft and I get critique from a bunch of different viewpoints I think is super valuable and will build expertise.

07.24: And then there’s the other side, which is, when we talk about consuming lots of information and then synthesizing or translating it into a draft, I can then critique “Is this actually valuable to the way that I think that this leader thinks? Or what I’m trying to convey as an impact?” And so then I am critiquing the straw man that is output by these prompts and agents.

07.46: Those two different patterns together actually create a really great loop for me to be able to learn not only from agents but also from the standpoint of seeing how. . . The part that ends up being really exciting is when once you start to connect the way communication happens inside the organization, I can then see what my leaders passed on to the next leader or what this person interpreted this as. And I can use that as a feedback loop to then improve, over time, my expertise in, say, writing a status report that is shaped for the leader. There’s also a whole thing that when we talk about status reporting in particular, there is a difference in expertise that people are getting that I’m not always 100%. . .

08.21: It’s valuable for me to understand how my leader thinks and makes decisions. I think that is very valuable. But the idea that I will spend hours and hours shaping and formulating a status report from my point of view for someone else can be aided by these types of systems. And so status should not be about the speaker’s mouth; it should be at the listener’s ear.

For these leaders, they want to be able to understand “Are the teams making the right decisions? Do I trust them? And then where should I preemptively intervene because of my experience or maybe my understanding of the context in the broader organization?” And so that’s what I would say: These tools are very valuable in helping build that expertise.

09.00: It’s just that we have to rethink “What is expertise?” And I just don’t buy it that paying your dues is the way you gain expertise. You do sometimes. Absolutely. But a lot of it is also just busy work and toil. 

09.11: My thing is these are productivity tools. And so you make even your junior employees productive—you just change the way you use your more-junior employees. 

09.24: Maybe just one thing to add to this is that there is something really interesting inside of the education world of using LLMs: trying to understand where someone is at. And so the type of feedback that someone that is very early in their career or first to doing something is potentially very different in the way that you’re teaching them or giving them feedback versus something that someone that is much further in expertise, they want to be able to just get down to “What are some things I’m missing here? Where am I biased?” Those are things where I think we also need to do a better job for those early employees, the people that are just starting to get expertise—“How do we train them using these tools as well as other ways?”

10.01: And I’ve done that as well. I do a lot of learning and development help, internal to companies, and I did that as part of the PM faculty for learning in development at Google. And so thinking a lot about how PMs gain expertise, I think we’re doing a real disservice to making it so that product manager as a junior position is so hard to get.

10.18: I think it’s really bad because, right out of college, I started doing program management, and it taught me so much about this. But at Microsoft, when I joined, we would say that the program manager wasn’t really worth very much for the first two years, right? Because they’re gaining expertise in this.

And so I think LLMs can help give the ability for people to gain expertise faster and also help them from avoiding making errors that other people might make. But I think there’s a lot to do with just learning and development in general that we need to pair with LLMs and human systems.

10.52: In terms of agents, I guess agents for product management, first of all, do they exist? And if they do, I always like to look at what level of autonomy they really have. Most agents really are still partially autonomous, right? There’s still a human in the loop. And so the question is “How much is the human in the loop?” It’s kind of like a self-driving car. There’s driver assists, and then there’s all the way to self-driving. A lot of the agents right now are “driver assist.” 

11.28: I think you’re right. That’s why I don’t always use the term “agent,” because it’s not an autonomous system that is storing memory using tools, constantly operating.

I would argue though that there is no such thing as “human out of the loop.” We’re probably just drawing the system diagram wrong if we’re saying that there’s no human that’s involved in some way. That’s the first thing. 

11.53: The second thing I’d say is that I think you’re right. A lot of the time right now, it ends up being when the human needs the help, we end up creating systems inside of GitHub; we have something that’s called GitHub spaces, which is really a custom GPT. It’s really just a bundling of context that I can then go to when I need help with a particular type of thing. We built very highly specific types of copilot spaces, like “I need to write a blog announcement about something. And so what’s the GitHub writing style? How should I be wording this avoiding jargon?” Internal things like that. So it can be highly specific. 

We also have more general tools that are kind of like “How do I form and maintain initiatives throughout the entire software development lifecycle? When do I need certain types of feedback? When do I need to generate the 12 to 14 different documents that compliance and downstream teams need?” And so those tend to be operating in the background to autodraft these things based on the context that’s available. And so that’s I’d say that’s semiagentic, to a certain extent. 

12.52: But I think actually there’s really big opportunities when it comes to. . . One of the cases that we’re working on right now is actually linking information in the GitHub graph that is not commonly linked. And so a key example of that might be kicking off all of the process that goes along with doing a release. 

When I first get started, I actually want to know in our customer feedback repo, in all the different places where we store customer feedback, “Where are there times that customers actually asked about this or complained about it or had some information about this?” And so when I get started, being able to automatically link something like a release tracking issue with all of this customer feedback becomes really valuable. But it’s very hard for me as an individual to do that. And what we really want—and what we’re building—[are] things that are more and more autonomous about constantly searching for feedback or information that we can then connect to this release tracking issue.

13.44: So that’s why I say we’re starting to get into the autonomous realm when it comes to this idea of something going around looking for linkages that don’t exist today. And so that’s one of those things, because again, we’re talking about information flow. And a lot of the time, especially in organizations the size of GitHub, there’s lots of siloing that takes place.

We have lots of repos. We have lots of information. And so it’s really hard for a single person to ever keep all of that in their head and to know where to go, and so [we’re] bringing all of that into the tools that they end up using. 

14.14: So for example, we’ve also created internal things—these are more assist-type use cases—but the idea of a Gemini Gem inside of a Google doc or an M365 agent inside of Word that is then also connected to the GitHub graph in some way. I think it’s “When do we expose this information? Is it always happening in the background, or is it only when I’m drafting the next version of this initiative that ends up becoming really, really important?”

14.41: Some of the work we’ve been experimenting with is actually “How do we start to include agents inside of the synchronous meetings that we actually do?” You probably don’t want an agent to suddenly start speaking, especially because there’s lots of different agents that you may want to have in a meeting.

We don’t have a designer on our team, so I actually end up using an agent that is prompted to be like a designer and think like a designer inside of these meetings. And so we probably don’t want them to speak up dynamically inside the meeting, but we do want them to add information if it’s helpful. 

We want to autoprototype things as a straw man for us to be able to react to. We want to start to use our planning agents and stuff like that to help us plan out “What is the work that might need to take place?” It’s a lot of experimentation about “How do we actually pull things into the places that humans are doing the work?”—which is usually synchronous meetings, some types of asynchronous communication like Teams or Slack, things like that.

15.32: So that’s where I’d say the full possibility [is] for, say, a PM. And our customers are also TPMs and leaders and people like that. It really has to do with “How are we linking synchronous and asynchronous conversations with all of this information that is out there in the ecosystem of our organization that we don’t know about yet, or viewpoints that we don’t have that we need to have in this conversation?”

15.55: You mentioned the notion of a design agent passively in the background, attending a meeting. This is fascinating. So this design agent, what is it? Is it a fine-tuned agent or. . .? What exactly makes it a design agent? 

16.13: In this particular case, it’s a specific prompt that defines what a designer would usually do in a cross-functional team and what they might ask questions about, what they would want clarification of. . .

16.26: Completely reliant on the pretrained foundation model—no posttraining, no RAG, nothing? 

16.32: No, no. [Everything is in the prompt] at this point. 

16.36: How big is this prompt? 

16.37: It’s not that big. I’d say it’s maybe at most 50 lines, something like that. It’s pretty small. The truth is, the idea of a designer is something that LLMs know about. But more for our specific case, right now it’s really just based on this live conversation. And there’s a lot of papercuts in the way that we have to do a site call, pull a live transcript, put it into a space, and [then] I have a bunch of different agents that are inside the space that will then pipe up when they have something interesting to say, essentially.

And it’s a little weird because I have to share my screen and people have to read it, hold the meeting. So it’s clunky right now in the way that we bring this in. But what it will bring up is “Hey, these are patterns inside of design that you may want to think about.” Or you know, “For this particular part of the experience, it’s still pretty ambiguous. Do you want to define more about what this part of the process is?” And we’ve also included legal, privacy, data-oriented groups. Even the idea of a facilitator agent saying that we were getting off track or we have these other things to discuss, that type of stuff. So again, these are really rudimentary right now.

17.37: Now, what I could imagine though is, we have a design system inside of GitHub. How might we start to use that design system and use internal prototyping tools to autogenerate possibilities for what we’re talking about? And I guess when I think about using prototyping as a PM, I don’t think the PMs should be vibe coding everything.

I don’t think the prototype replaces a lot of the cross-functional documents that we have today. But I think what it does increase is that if we have been talking about a feature for about 30 minutes, that is a lot of interesting context that if we can say, “Autogenerate three different prototypes that are coming from slightly different directions, slightly different places that we might integrate inside of our current product,” I think what it does is it gives us, again, that straw man for us to be able to critique, which will then uncover additional assumptions, additional values, additional principles that we maybe haven’t written down somewhere else.

18.32: And so I see that as super valuable. And that’s the thing that we end up doing—we’ll use an internal product for prototyping to just take that and then have it autogenerated. It takes a little while right now, you know, a couple minutes to do a prototype generation. And so in those cases we’ll just [say], “Here’s what we thought about so far. Just give us a prototype.” And again it doesn’t always do the right thing, but at least it gives us something to now talk about because it’s more real now. It is not the thing that we end up implementing, but it is the thing that we end up talking about. 

18.59: By the way, this notion of an agent attending synchronous some meeting, you can imagine taking it to the next level, which is to take advantage of multimodal models. The agent can then absorb speech and maybe visual cues, so then basically when the agent suggests something and someone reacts with a frown. . . 

19.25: I think there’s something really interesting about that. And when you talk about multimodal, I do think that one of the things that is really important about human communication is the way that we pick up cues from each other—if we think about it, the reason why we actually talk to each other. . . And there’s a great book called The Enigma of Reason that’s all about this.

But their hypothesis is that, yes, we can try to logic or pretend to logic inside of our own heads, but we actually do a lot of post hoc analysis. So we come up with an idea inside our head. We have some certainty around it, some intuition, and then we fit it to why we thought about this. So that’s what we do internally. 

But when you and I are talking, I’m actually trying to read your mind in some way. I’m trying to understand the norms that are at play. And I’m using your facial expression. I’m using your tone of voice. I’m using what you’re saying—actually way less of what you’re saying and more your facial expression and your tone of voice—to determine what’s going on.

20.16: And so I think this idea of engagement with these tools and the way these tools work, I think [of] the idea of gaze tracking: What are people looking at? What are people talking about? How are people reacting to this? And then I think this is where in the future, in some of the early prototypes we built internally for what the synchronous meeting would look like, we have it where the agent is raising its hand and saying, “Here’s an issue that we may want to discuss.” If the people want to discuss it, they can discuss it, or they can ignore it. 

20.41: Longer term, we have to start to think about how agents are fitting into the turn-taking of conversation with the rest of the group. And using all of these multimodal cues ends up being very interesting, because you wouldn’t want just an agent whenever it thinks of something to just blurt it out.

20.59: And so there’s a lot of work to do here, but I think there’s something really exciting about just using engagement as the meaning to understand what are the hot topics, but also trying to help detect “Are we rat-holing on something that should be put in the parking lot?” Those are things and cues that we can start to get from these systems as well.

21.16: By the way, context has multiple dimensions. So you can imagine in a meeting between the two of us, you outrank me. You’re my manager. But then it turns out the agent realizes, “Well, actually, looking through the data in the company, Ben knows more about this topic than Chris. So maybe when I start absorbing their input, I should weigh Ben’s, even though in the org chart Chris outranks Ben.” 

21.46: A related story is one of the things I’ve created inside of a copilot space is actually a proxy for our CPO. And so what I’ve done is I’ve taken meetings that he’s done where he asked questions in a smaller setting, taking his writing samples and things that, and I’ve tried to turn it into a, not really an agent, but a space where I can say, “Here’s what I’m thinking about for this plan. And what would Mario [Rodriguez] potentially think about this?” 

It’s definitely not 100% accurate in any way. Mario’s an individual that is constantly changing and is learning and has intuitions that he doesn’t say out loud, but it is interesting how it does sound like him. It does seem to focus on questions that he would bring up in a previous meeting based on the context that we provided. And so I think to your point, a lot of things that right now are said inside of meetings that we then don’t use to actually help understand people’s points of view in a deeper way.

22.40: You could imagine that this proxy also could be used for [determining] potential blind spots for Mario that, as a person that is working on this, I may need to deal with, in the sense that maybe he’s not always focused on this type of issue, but I think it’s a really big deal. So how do I help him actually understand what’s going on?

22.57: And this gets back to that reporting: Is that the listener’s ear? What does that person actually care about? What do they need to know about to build trust with the team? What do they need to take action on? Those are things that I think we can start to build interesting profiles. 

There’s a really interesting ethical question, which is: Should that person be able to write their own proxy? Would it include the blind spots that they have or not? And then maybe compare this to—you know, there’s [been] a trend for a little while where every leader would write their own user manual or readme, and inside of those things, they tend to be a bit more performative. It’s more about how they idealize their behavior versus the way that they actually are.

23.37: And so there’s some interesting problems that start to come up when we’re doing proxying. I don’t call it a digital twin of a person, because digital twins to me are basically simulations of mechanical things. But to me it’s “What is this proxy that might sit in this meeting to help give us a perspective and maybe even identify when this is something we should escalate to that person?”

23.55: I think there’s lots of very interesting things. Power structures inside of the organization are really hard to discern because there’s both, to your point, hierarchical ones that are very set in the systems that are there, but there’s also unsaid ones. 

I mean, one funny story is Ray Dalio did try to implement this inside of his hedge fund. And unfortunately, I guess, for him, there were two people that were considered to be higher ranking in reputation than him. But then he changed the system so that he was ranked number one. So I guess we have to worry about this type of thing for these proxies as well. 

24.27: One of the reasons why coding is such a great playground for these things is one, you can validate the result. But secondly, the data is quite tame and relatively right. So you have version control systems GitHub—you can look through that and say, “Hey, actually Ben’s commits are much more valuable than Chris’s commits.” Or “Ben is the one who suggested all of these changes before, and they were all accepted. So maybe we should really take Ben’s opinion much more strong[ly].” I don’t know what artifacts you have in the product management space that can help develop this reputation score.

25.09: Yeah. It’s tough because a reputation score, especially once you start to monitor some type of metric and it becomes the goal, that’s where we get into problems. For example, Agile teams adopting velocity as a metric: It’s meant to be an internal metric that helps us understand “If this person is out, how does that adjust what type of work we need to do?” But then comparing velocities between different teams ends up creating a whole can of worms around “Is this actually the metric that we’re trying to optimize for?”

25.37: And even when it comes to product management, what I would say is actually valuable a lot of the time is “Does the team understand why they’re working on something? How does it link to the broader strategy? How does this solve both business and customer needs? And then how are we wrangling this uncertainty of the world?” 

I would argue that a really key meta skill for product managers—and for other people like generative user researchers, business development people, you know, even leaders inside the organization—they have to deal with a lot of uncertainty. And it’s not that we need to shut down the uncertainty, because actually uncertainty is an advantage that we should take advantage of and something we should use in some way. But there are places where we need to be able to build enough certainty for the team to do their work and then make plans that are resilient in the future uncertainty. 

26.24: And then finally, the ability to communicate what the team is doing and why it’s important is very valuable. Unfortunately, there’s not a lot of. . . Maybe there’s rubrics we can build. And that’s actually what career ladders try to do for product managers. But they tend to be very vague actually. And as you get more senior inside of a product manager organization, you start to see things—it’s really just broader views, more complexity. That’s really what we start to judge product managers on. Because of that fact, it’s really about “How are you working across the team?”

26.55: There will be cases, though, that we can start to say, “Is this thing thought out well enough at first, at least for the team to be able to take action?” And then linking that work as a team to outcomes ends up being something that we can apply more and more data rigor to. But I worry about it being “This initiative brief was perfect, and so that meant the success of the product,” when the reality was that was maybe the starting point, but there was all this other stuff that the product manager and the team was doing together. So I’m always wary of that. And that’s where performance management for PMs is actually pretty hard: where you have to base most of your understanding on how they work with the other teammates inside their team.

27.35: You’ve been in product for a long time so you have a lot of you have a network of peers in other companies, right? What are one or two examples of the use of AI—not in GitHub—in the product management context that you admire? 

27.53: For a lot of the people that I know that are inside of startups that are basically using prototyping tools to build out their initial product, I have a lot of, not necessarily envy, but I respect that a lot because you have to be so scrappy inside of a startup, and you’re really there to not only prove something to a customer, or actually not even prove something, but get validation from customers that you’re building the right thing. And so I think that type of rapid prototyping is something that is super valuable for that stage of an organization.

28.26: When I start to then look at larger enterprises, what I do see that I think is not as well a help with these prototyping tools is what we’ll call brownfield development: We need to build something on top of this other thing. It’s actually hard to use these tools today to imagine new things inside of a current ecosystem or a current design system.

28.46: [For] a lot of the teams that are in other places, it really is a struggle to get access to some of these tools. The thing that’s holding back the biggest enterprises from actually doing interesting work in this area is they’re overconstraining what their engineers [and] product managers can use as far as these tools.

And so what’s actually being created is shadow systems, where the person is using their personal ChatGPT to actually do the work rather than something that’s within the compliance of the organization.

29:18: Which is great for IP protection. 

29:19: Exactly! That’s the problem, right? Some of this stuff, you do want to use the most current tools. Because there is actually not just [the] time savings aspect and toil reduction aspects—there’s also just the fact that it helps you think differently, especially if you’re an expert in your domain. It really aids you in becoming even better at what you’re doing. And then it also shores up some of your weaknesses. Those are the things that really expert people are using these types of tools for. But in the end, it comes down to a combination of legal, HR, and IT, and budgetary types of things too, that are holding back some of these organizations.

30.00: When I’m talking to other people inside of the orgs. . . Maybe another problem for enterprises right now is that a lot of these tools require lots of different context. We’ve benefited inside of GitHub in that a lot of our context is inside the GitHub graph, so Copilot can access it and use it. But for other teams they keep things and all of these individual vendor platforms.

And so the biggest problem then ends up being “How do we merge these different pieces of context in a way that is allowed?” When I first started working in the team of Synapse, I looked at the patterns that we were building and it was like “If we just had access to Zapier or Relay or something like that, that is exactly what we need right now.” Except we would not have any of the approvals for the connectors to all of these different systems. And so Airtable is a great example of something like that too: They’re building out process automation platforms that focus on data as well as connecting to other data sources, plus the idea of including LLMs as components inside these processes.

30.58: A really big issue I see for enterprises in general is the connectivity issue between all the datasets. And there are, of course, teams that are working on this—Glean or others that are trying to be more of an overall data copilot frontend for your entire enterprise datasets. But I just haven’t seen as much success in getting all these connected. 

31.17: I think one of the things that people don’t realize is enterprise search is not turnkey. You have to get in there and really do all these integrations. There’s no shortcuts. There’s no, if a vendor comes to you and says, yeah, just use our system, it all magically works.

31.37: This is why we need to hire more people with degrees in library science, because they actually know how to manage these types of systems. Again, my first cutting my teeth on this was in very early versions of SharePoint a long time ago. And even inside there, there’s so much that you need to do to just help people with not only organization of the data but even just the search itself.

It’s not just a search index problem. It’s a bunch of different things. And that’s why whenever we’re shown an empty text box, that’s why there’s so much work that goes into just behind that; inside of Google, all of the instant answers, there’s lots of different ways that a particular search query is actually looked at, not just to go against the search index but to also just provide you the right information. And now they’re trying to include Gemini by default in there. The same thing happens within any copilot. There’s a million different things you could use. 

32.27: And so I guess maybe this gets to my hypothesis about the way that agents will be valuable, either fully autonomous ones or ones that are attached to a particular process. But having many different agents that are highly biased in a particular way. And I use the term bias as in bias can be good, neutral, and bad, right? I don’t mean bias in a way of unfairness and that type of stuff; I mean more from the standpoint of “This agent is meant to represent this viewpoint, and it’s going to give you feedback from this viewpoint.” That ends up becoming really, really valuable because of that fact that you will not always be thinking about everything. 

33.00: I’ve done a lot of work in adversarial thinking and red teaming and stuff like that. One of the things that is most valuable is to build prompts that are breaking the sycophancy of these different models that are there by default, because it should be about challenging my thinking rather than just agreeing with it.

And then the standpoint of each one of these highly biased agents actually helps provide a very interesting approach. I mean, if we go to things like meeting facilitation or workshop facilitation groups, this is why. . . I don’t know if you’re familiar with the six hats, but the six hats is a technique by which we declare inside of a meeting that I’m going to be the one that’s all positivity. This person’s going to be the one about data. This person’s gonna be the one that’s the adversarial, negative one, etc., etc. When you have all of these different viewpoints, you actually end up because of the tensions in the discussion of those ideas, the creation of options, the weighing of options, I think you end up making much better decisions. That’s where I think those highly biased viewpoints end up becoming really valuable. 

34.00: For product people who are early in their career or want to enter the field, what are some resources that they should be looking at in terms of leveling up on the use AI in this context?

34.17: The first thing is there are millions of prompt libraries out there for product managers. What you should do is when you are creating work, you should be using a lot of these prompts to give you feedback, and you can actually even write your own, if you want to. But I would say there’s lots of material out there for “I need to write this thing.”

What is a way to [do something like] “I try to write it and then I get critique”? But then how might this AI system, through a prompt, generate a draft of this thing? And then I go in and look at it and say, “Which things are not actually quite right here?” And I think that again, those two patterns of getting critique and giving critique end up building a lot of expertise.

34.55: I think also within the organization itself, I believe an awful lot in things that are called basically “learning from your peers.” Being able to join small groups where you are getting feedback from your peers and including AI agent feedback inside of the small peer groups is very valuable. 

There’s another technique, which is using case studies. And I actually, as part of my learning development practice, do something called “decision forcing cases” where we take a story that actually happened, we walk people through it and we ask them, “What do they think is happening; what would they do next?” But having that where you do those types of things across junior and senior people, you can start to actually learn the expertise from the senior people through these types of case studies.

35.37: I think there’s an awful lot more that senior leaders inside the organization should be doing. And as junior people inside your organization, you should be going to these senior leaders and saying, “How do you think about this? What is the way that you make these decisions?” Because what you’re actually pulling from is their past experience and expertise that they’ve gained to build that intuition.

35.53: There’s all sorts of surveys of programmers and engineers and AI. Are there surveys about product managers? Are they freaked out or what? What’s the state of adoption and this kind of thing? 

36.00: Almost every PM that I’ve met has used an LLM in some way, to help them with their writing in particular. And if you look at the studies by ChatGPT or OpenAI about the use of ChatGPT, a lot of the writing tasks end up being from a product manager or senior leader standpoint. I think people are freaked out because every practice says that this other practice is going to be replaced because I can in some way replace them right now with a viewpoint.

36.38: I don’t think product management will go away. We may change the terminology that we end up using. But this idea of someone that is helping manage the complexity of the team, help with communication, help with [the] decision-making process inside that team is still very valuable and will be valuable even when we can start to autodraft a PRD.

I would argue that the draft of the PRD is not what matters. It’s actually the discussions that take place in the team after the PRD is created. And I don’t think that designers are going to take over the PM work because, yes, it is about to a certain extent the interaction patterns and the usability of things and the design and the feeling of things. But there’s all these other things that you need to worry about when it comes to matching it to business models, matching it to customer mindsets, deciding which problems to solve. They’re doing that. 

37.27: There’s a lot of this concern about [how] every practice is saying this other practice is going to go away because of AI. I just don’t think that’s true. I just think we’re all going to be given different levels of abstraction to gain expertise on. But the core of what we do—an engineer focusing on what is maintainable and buildable and actually something that we want to work on versus the designer that’s building something usable and something that people will feel good using, and a product manager making sure that we’re actually building the thing that is best for the company and the user—those are things that will continue to exist even with these AI tools, prototyping tools, etc.

38.01: And for our listeners, as Chris mentioned, there’s many, many prompt templates for product managers. We’ll try to get Chris to recommend one, and we’ll put it in the episode notes. [See “Resources from Chris” below.] And with that thank you, Chris. 

38.18: Thank you very much. Great to be here.

Resources from Chris

Here’s what Chris shared with us following the recording:

There are two [prompt resources for product managers] that I think people should check out:

However, I’d say that people should take these as a starting point and they should adapt them for their own needs. There is always going to be nuance for their roles, so they should look at how people do the prompting and modify for their own use. I tend to look at other people’s prompts and then write my own.

If they are thinking about using prompts frequently, I’d make a plug for Copilot Spaces to pull that context together.

💾

The Great Rewiring: How the pandemic set the stage for AI — and what’s next

25 October 2025 at 12:00
Colette Stallbaumer, co-founder of Microsoft WorkLab and author of WorkLab: Five years that shook the business world and sparked an AI-first future. (GeekWire Photo / Todd Bishop)

From empty offices in 2020 to AI colleagues in 2025, the way we work has been completely rewired over the past five years. Our guest on this week’s GeekWire Podcast studies these changes closely along with her colleagues at Microsoft.

Colette Stallbaumer is the co-founder of Microsoft WorkLab, general manager of Microsoft 365 Copilot, and the author of the new book, WorkLab: Five years that shook the business world and sparked an AI-first future, from Microsoft’s 8080 Books.

As Stallbaumer explains in the book, the five-year period starting with the pandemic and continuing to the current era of AI represents one continuous transformation in the way we work, and it’s not over yet.

“Change is the only constant—shifting norms that once took decades to unfold now materialize in months or weeks,” she writes. “As we look to the next five years, it’s nearly impossible to imagine how much more work will change.”

Listen below for our conversation, recorded on Microsoft’s Redmond campus. Subscribe on Apple or Spotify, and continue reading for key insights from the conversation.

The ‘Hollywood model’ of teams: “What we’re seeing is this movement in teams, where we’ll stand up a small squad of people who bring their own domain expertise, but also have AI added into the mix. They come together just like you would to produce a film. A group of people comes together to produce a blockbuster, and then you disperse and go back to your day job.”

The concept of the ‘frontier firm’: “They’re not adding AI as an ingredient. AI is the business model. It’s the core. And these frontier firms can have a small number of people using AI in this way, generating a pretty high run rate. So it’s a whole new way to think about shipping, creating, and innovating.”

The fallacy of ‘AI strategy’: “The idea that you just need to have an ‘AI strategy’ is a bit of a fallacy. Really, you kind of want to start with the business problem and then apply AI. … Where are you spending the most and where do you have the biggest challenges? Those are great areas to actually think about putting AI to work for you.”

Adapting to AI: “You have to build the habit and build the muscle to work in this new way and have that moment of, ‘Oh, wait, I don’t actually need to do this.’ “

The biggest risk related to AI: “The biggest risk is not AI in and of itself. It’s that people won’t evolve fast enough with AI. It’s the human risk and ability to actually start to really use these new tools and build the habit.”

Human creativity and AI: “It still takes that spark and that seed of creativity. And then when you combine it with these new tools, that’s where I have a lot of hope and optimism for what people are going to be able to do and invent in the future.”

Audio editing by Curt Milton.

Subscribe to GeekWire in Apple Podcasts, Spotify, or wherever you listen.

Generative AI in the Real World: Context Engineering with Drew Breunig

16 October 2025 at 07:18

In this episode, Ben Lorica and Drew Breunig, a strategist at the Overture Maps Foundation, talk all things context engineering: what’s working, where things are breaking down, and what comes next. Listen in to hear why huge context windows aren’t solving the problems we hoped they might, why companies shouldn’t discount evals and testing, and why we’re doing the field a disservice by leaning into marketing and buzzwords rather than trying to leverage what current crop of LLMs are actually capable of.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00: All right. So today we have Drew Breunig. He is a strategist at the Overture Maps Foundation. And he’s also in the process of writing a book for O’Reilly called the Context Engineering Handbook. And with that, Drew, welcome to the podcast.

00.23: Thanks, Ben. Thanks for having me on here. 

00.26: So context engineering. . . I remember before ChatGPT was even released, someone was talking to me about prompt engineering. I said, “What’s that?” And then of course, fast-forward to today, now people are talking about context engineering. And I guess the short definition is it’s the delicate art and science of filling the context window with just the right information. What’s broken with how teams think about context today? 

00.56: I think it’s important to talk about why we need a new word or why a new word makes sense. I was just talking with Mike Taylor, who wrote the prompt engineering book for O’Reilly, exactly about this and why we need a new word. Why is prompt engineering not good enough? And I think it has to do with the way the models and the way they’re being built is evolving. I think it also has to deal with the way that we’re learning how to use these models. 

And so prompt engineering was a natural word to think about when your interaction and how you program the model was maybe one turn of conversation, maybe two, and you might pull in some context to give it examples. You might do some RAG and context augmentation, but you’re working with this one-shot service. And that was really similar to the way people were working in chatbots. And so prompt engineering started to evolve as this thing. 

02.00: But as we started to build agents and as companies started to develop models that were capable of multiturn tool-augmented reasoning usage, suddenly you’re not using that one prompt. You have a context that is sometimes being prompted by you, sometimes being modified by your software harness around the model, sometimes being modified by the model itself. And increasingly the model is starting to manage that context. And that prompt is very user-centric. It is a user giving that prompt. 

But when we start to have these multiturn systematic editing and preparation of contexts, a new word was needed, which is this idea of context engineering. This is not to belittle prompt engineering. I think it’s an evolution. And it shows how we’re evolving and finding this space in real time. I think context engineering is more suited to agents and applied AI programing, whereas prompt engineering lives in how people use chatbots, which is a different field. It’s not better and not worse. 

And so context engineering is more specific to understanding the failure modes that occur, diagnosing those failure modes and establishing good practices for both preparing your context but also setting up systems that fix and edit your context, if that makes sense. 

03.33: Yeah, and also, it seems like the words themselves are indicative of the scope, right? So “prompt” engineering means it’s the prompt. So you’re fiddling with the prompt. And [with] context engineering, “context” can be a lot of things. It could be the information you retrieve. It might involve RAG, so you retrieve information. You put that in the context window. 

04.02: Yeah. And people were doing that with prompts too. But I think in the beginning we just didn’t have the words. And that word became a big empty bucket that we filled up. You know, the quote I always quote too often, but I find it fitting, is one of my favorite quotes from Stuart Brand, which is, “If you want to know where the future is being made, follow where the lawyers are congregating and the language is being invented,” and the arrival of context engineering as a word came after the field was invented. It just kind of crystallized and demarcated what people were already doing. 

04.36: So the word “context” means you’re providing context. So context could be a tool, right? It could be memory. Whereas the word “prompt” is much more specific. 

04.55: And I think it also is like, it has to be edited by a person. I’m a big advocate for not using anthropomorphizing words around large language models. “Prompt” to me involves agency. And so I think it’s nice—it’s a good delineation. 

05.14: And then I think one of the very immediate lessons that people realize is, just because. . . 

So one of the things that these model providers do when they have a model release,  one of the things they note is, What’s the size of the context window? So people started associating context window [with] “I stuff as much as I can in there.” But the reality is actually that, one, it’s not efficient. And two, it also is not useful to the model. Just because you have a massive context window doesn’t mean that the model treats the entire context window evenly.

05.57: Yeah, it doesn’t treat it evenly. And it’s not a one-size-fits-all solution. So I don’t know if you remember last year, but that was the big dream, which was, “Hey, we’re doing all this work with RAG and augmenting our context. But wait a second, if we can make the context 1 million tokens, 2 million tokens, I don’t have to run RAG on all of my corporate documents. I can just fit it all in there, and I can constantly be asking this. And if we can do this, we essentially have solved all of the hard problems that we were worrying about last year.” And so that was the big hope. 

And you started to see an arms race of everybody trying to make bigger and bigger context windows to the point where, you know, Llama 4 had its spectacular flameout. It was rushed out the door. But the headline feature by far was “We will be releasing a 10 million token context window.” And the thing that everybody realized is. . .  Like, all right, we were really hopeful for that. And then as we started building with these context windows, we started to realize there were some big limitations around them.

07.01: Perhaps the thing that clicked for me was in Google’s Gemini 2.5 paper. Fantastic paper. And one of the reasons I love it is because they dedicate about four pages in the appendix to talking about the kind of methodology and harnesses they built so that they could teach Gemini to play Pokémon: how to connect it to the game, how to actually read out the state of the game, how to make choices about it, what tools they gave it, all of these other things.

And buried in there was a real “warts and all” case study, which are my favorite when you talk about the hard things and especially when you cite the things you can’t overcome. And Gemini 2.5 was a million-token context window with, eventually, 2 million tokens coming. But in this Pokémon thing, they said, “Hey, we actually noticed something, which is once you get to about 200,000 tokens, things start to fall apart, and they fall apart for a host of reasons. They start to hallucinate. One of the things that is really demonstrable is they start to rely more on the context knowledge than the weights knowledge. 

08.22: So inside every model there’s a knowledge base. There’s, you know, all of these other things that get kind of buried into the parameters. But when you reach a certain level of context, it starts to overload the model, and it starts to rely more on the examples in the context. And so this means that you are not taking advantage of the full strength or knowledge of the model. 

08.43: So that’s one way it can fail. We call this “context distraction,” though Kelly Hong at Chroma has written an incredible paper documenting this, which she calls “context rot,” which is a similar way [of] charting when these benchmarks start to fall apart.

Now the cool thing about this is that you can actually use this to your advantage. There’s another paper out of, I believe, the Harvard Interaction Lab, where they look at these inflection points for. . . 

09.13: Are you familiar with the term “in-context learning”? In-context learning is when you teach the model to do something that doesn’t know how to do by providing examples in your context. And those examples illustrate how it should perform. It’s not something that it’s seen before. It’s not in the weights. It’s a completely unique problem. 

Well, sometimes those in-context learning[s] are counter to what the model has learned in the weights. So they end up fighting each other, the weights and the context. And this paper documented that when you get over a certain context length, you can overwhelm the weights and you can force it to listen to your in-context examples.

09.57: And so all of this is just to try to illustrate the complexity of what’s going on here and how I think one of the traps that leads us to this place is that the gift and the curse of LLMs is that we prompt and build contexts that are in the English language or whatever language you speak. And so that leads us to believe that they’re going to react like other people or entities that read the English language.

And the fact of the matter is, they don’t—they’re reading it in a very specific way. And that specific way can vary from model to model. And so you have to systematically approach this to understand these nuances, which is where the context management field comes in. 

10.35: This is interesting because even before those papers came out, there were studies which showed the exact opposite problem, which is the following: You may have a RAG system that actually retrieves the right information, but then somehow the LLMs can still fail because, as you alluded to, they have weights so they have prior beliefs. You saw something [on] the internet, and they will opine against the precise information you retrieve from the context. 

11.08: This is a really big problem. 

11.09: So this is true even if the context window’s small actually. 

11.13: Yeah, and Ben, you touched on something that’s really important. So in my original blog post, I document four ways that context fails. I talk about “context poisoning.” That’s when you hallucinate something in a long-running task and it stays in there, and so it’s continually confusing it. “Context distraction,” which is when you overwhelm that soft limit to the context window and then you start to perform poorly. “Context confusion”: This is when you put things that aren’t relevant to the task inside your context, and suddenly they think the model thinks that it has to pay attention to this stuff and it leads them astray. And then the last thing is “context clash,” which is when there’s information in the context that’s at odds with the task that you are trying to perform. 

A good example of this is, say you’re asking the model to only reply in JSON, but you’re using MCP tools that are defined with XML. And so you’re creating this backwards thing. But I think there’s a fifth piece that I need to write about because it keeps coming up. And it’s exactly what you described.

12.23: Douwe [Kiela] over at Contextual AI refers to this as “context” or “prompt adherence.” But the term that keeps sticking in my mind is this idea of fighting the weights. There’s three situations you get yourself into when you’re interacting with an LLM. The first is when you’re working with the weights. You’re asking it a question that it knows how to answer. It’s seen many examples of that answer. It has it in its knowledge base. It comes back with the weights, and it can give you a phenomenal, detailed answer to that question. That’s what I call “working with the weights.” 

The second is what we referred to earlier, which is that in-context learning, which is you’re doing something that it doesn’t know about and you’re showing an example, and then it does it. And this is great. It’s wonderful. We do it all the time. 

But then there’s a third example which is, you’re providing it examples. But those examples are at odds with some things that it had learned usually during posttraining, during the fine-tuning or RL stage. A really good example is format outputs. 

13.34: Recently a friend of mine was updating his pipeline to try out a new model, Moonshots. A really great model and really great model for tool use. And so he just changed his model and hit run to see what happened. And he kept failing—his thing couldn’t even work. He’s like, “I don’t understand. This is supposed to be the best tool use model there is.” And he asked me to look at his code.

I looked at his code and he was extracting data using Markdown, essentially: “Put the final answer in an ASCII box and I’ll extract it that way.” And I said, “If you change this to XML, see what happens. Ask it to respond in XML, use XML as your formatting, and see what happens.” He did that. That one change passed every test. Like basically crushed it because it was working with the weights. He wasn’t fighting the weights. Everyone’s experienced this if you build with AI: the stubborn things it refuses to do, no matter how many times you ask it, including formatting. 

14.35: [Here’s] my favorite example of this though, Ben: So in ChatGPT’s web interface or their application interface, if you go there and you try to prompt an image, a lot of the images that people prompt—and I’ve talked to user research about this—are really boring prompts. They have a text box that can be anything, and they’ll say something like “a black cat” or “a statue of a man thinking.”

OpenAI realized this was leading to a lot of bad images because the prompt wasn’t detailed; it wasn’t a good prompt. So they built a system that recognizes if your prompt is too short, low detail, bad, and it hands it to another model and says, “Improve this prompt,” and it improves the prompt for you. And if you inspect in Chrome or Safari or Firefox, whatever, you inspect the developer settings, you can see the JSON being passed back and forth, and you can see your original prompt going in. Then you can see the improved prompt. 

15.36: My favorite example of this [is] I asked it to make a statue of a man thinking, and it came back and said something like “A detailed statue of a human figure in a thinking pose similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone sitting on a pedestal. . .” Blah blah blah blah blah blah. A paragraph. . . But below that prompt there were instructions to the chatbot or to the LLM that said, “Generate this image and after you generate the image, do not reply. Do not ask follow up questions. Do not ask. Do not make any comments describing what you’ve done. Just generate the image.” And in this prompt, then nine times, some of them in all caps, they say, “Please do not reply.” And the reason is because a big chunk of OpenAI’s posttraining is teaching these models how to converse back and forth. They want you to always be asking a follow-up question and they train it. And so now they have to fight the prompts. They have to add in all these statements. And that’s another way that fails. 

16.42: So why I bring this up—and this is why I need to write about it—is as an applied AI developer, you need to recognize when you’re fighting the prompt, understand enough about the posttraining of that model, or make some assumptions about it, so that you can stop doing that and try something different, because you’re just banging your head against a wall and you’re going to get inconsistent, bad applications and the same statement 20 times over. 

17.07: By the way, the other thing that’s interesting about this whole topic is, people actually somehow have underappreciated or forgotten all of the progress we’ve made in information retrieval. There’s a whole. . . I mean, these people have their own conferences, right? Everything from reranking to the actual indexing, even with vector search—the information retrieval community still has a lot to offer, and it’s the kind of thing that people underappreciated. And so by simply loading your context window with massive amounts of garbage, you’re actually, leaving on the field so much progress in information retrieval.

18.04: I do think it’s hard. And that’s one of the risks: We’re building all this stuff so fast from the ground up, and there’s a tendency to just throw everything into the biggest model possible and then hope it sorts it out.

I really do think there’s two pools of developers. There’s the “throw everything in the model” pool, and then there’s the “I’m going to take incremental steps and find the most optimal model.” And I often find that latter group, which I called a compound AI group after a paper that was published out of Berkeley, those tend to be people who have run data pipelines, because it’s not just a simple back and forth interaction. It’s gigabytes or even more of data you’re processing with the LLM. The costs are high. Latency is important. So designing efficient systems is actually incredibly key, if not a total requirement. So there’s a lot of innovation that comes out of that space because of that kind of boundary.

19.08: If you were to talk to one of these applied AI teams and you were to give them one or two things that they can do right away to improve, or fix context in general, what are some of the best practices?

19.29: Well you’re going to laugh, Ben, because the answer is dependent on the context, and I mean the context in the team and what have you. 

19.38: But if you were to just go give a keynote to a general audience, if you were to list down one, two, or three things that are the lowest hanging fruit, so to speak. . .

19.50: The first thing I’m gonna do is I’m going to look in the room and I’m going to look at the titles of all the people in there, and I’m going to see if they have any subject-matter experts or if it’s just a bunch of engineers trying to build something for subject-matter experts. And my first bit of advice is you need to get yourself a subject-matter expert who is looking at the data, helping you with the eval data, and telling you what “good” looks like. 

I see a lot of teams that don’t have this, and they end up building fairly brittle prompt systems. And then they can’t iterate well, and so that enterprise AI project fails. I also see them not wanting to open themselves up to subject-matter experts, because they want to hold on to the power themselves. It’s not how they’re used to building. 

20.38: I really do think building in applied AI has changed the power dynamic between builders and subject-matter experts. You know, we were talking earlier about some of like the old Web 2.0 days and I’m sure you remember. . . Remember back at the beginning of the iOS app craze, we’d be at a dinner party and someone would find out that you’re capable of building an app, and you would get cornered by some guy who’s like “I’ve got a great idea for an app,” and he would just talk at you—usually a he. 

21.15: This is back in the Objective-C days. . .

21.17: Yes, way back when. And this is someone who loves Objective-C. So you’d get cornered and you’d try to find a way out of that awkward conversation. Nowadays, that dynamic has shifted. The subject-matter expertise is so important for codifying and designing the spec, which usually gets specced out by the evals that it leads itself to more. And you can even see this. OpenAI is arguably creating and at the forefront of this stuff. And what are they doing? They’re standing up programs to get lawyers to come in, to get doctors to come in, to get these specialists to come in and help them create benchmarks because they can’t do it themselves. And so that’s the first thing. Got to work with the subject-matter expert. 

22.04: The second thing is if they’re just starting out—and this is going to sound backwards, given our topic today—I would encourage them to use a system like DSPy or GEPA, which are essentially frameworks for building with AI. And one of the components of that framework is that they optimize the prompt for you with the help of an LLM and your eval data. 

22.37: Throw in BAML?

22.39: BAML is similar [but it’s] more like the spec for how to describe the entire spec. So it’s similar.

22.52: BAML and TextGrad? 

22.55: TextGrad is more like the prompt optimization I’m talking about. 

22:57: TextGrad plus GEPA plus Regolo?

23.02: Yeah, those things are really important. And the reason I say they’re important is. . .

23.08: I mean, Drew, those are kind of advanced topics. 

23.12: I don’t think they’re that advanced. I think they can appear really intimidating because everybody comes in and says, “Well, it’s so easy. I could just write what I want.” And this is the gift and curse of prompts, in my opinion. There’s a lot of things to like about.

23.33: DSPy is fine, but I think TextGrad, GEPA, and Regolo. . .

23.41: Well. . . I wouldn’t encourage you to use GEPA directly. I would encourage you to use it through the framework of DSPy. 

23.48: The point here is if it’s a team building, you can go down essentially two paths. You can handwrite your prompt, and I think this creates some issues. One is as you build, you tend to have a lot of hotfix statements like, “Oh, there’s a bug over here. We’ll say it over here. Oh, that didn’t fix it. So let’s say it again.” It will encourage you to have one person who really understands this prompt. And so you end up being reliant on this prompt magician. Even though they’re written in English, there’s kind of no syntax highlighting. They get messier and messier as you build the application because they start to grow and become these growing collections of edge cases.

24.27: And the other thing too, and this is really important, is when you build and you spend so much time honing a prompt, you’re doing it against one model, and then at some point there’s going to be a better, cheaper, more effective model. And you’re going to have to go through the process of tweaking it and fixing all the bugs again, because this model functions differently.

And I used to have to try to convince people that this was a problem, but they all kind of found out when OpenAI deprecated all of their models and tried to move everyone over to GPT-5. And now I hear about it all the time. 

25.03: Although I think right now “agents” is our hot topic, right? So we talk to people about agents and you start really getting into the weeds, you realize, “Oh, okay. So their agents are really just prompts.” 

25.16: In the loop. . .

25.19: So agent optimization in many ways means injecting a bit more software engineering rigor in how you maintain and version. . .

25.30: Because that context is growing. As that loop goes, you’re deciding what gets added to it. And so you have to put guardrails in—ways to rescue from failure and figure out all these things. It’s very difficult. And you have to go at it systematically. 

25.46: And then the problem is that, in many situations, the models are not even models that you control, actually. You’re using them through an API like OpenAI or Claude so you don’t actually have access to the weights. So even if you’re one of the super, super advanced teams that can do gradient descent and backprop, you can’t do that. Right? So then, what are your options for being more rigorous in doing optimization?

Well, it’s precisely these tools that Drew alluded to, which is the TextGrads of the world, the GEPA. You have these compound systems that are nondifferentiable. So then how do you actually do optimization in a world where you have things that are not differentiable? Right. So these are precisely the tools that will allow you to turn it from somewhat of a, I guess, black art to something with a little more discipline. 

26.53: And I think a good example is, even if you aren’t going to use prompt optimization-type tools. . . The prompt optimization is a great solution for what you just described, which is when you can’t control the weights of the models you’re using. But the other thing too, is, even if you aren’t going to adopt that, you need to get evals because that’s going to be step one for anything, which is you need to start working with subject-matter experts to create evals.

27.22: Because what I see. . . And there was just a really dumb argument online of “Are evals worth it or not?” And it was really silly to me because it was positioned as an either-or argument. And there were people arguing against evals, which is just insane to me. And the reason they were arguing against evals is they’re basically arguing in favor of what they called, to your point about dark arts, vibe shipping—which is they’d make changes, push those changes, and then the person who was also making the changes would go in and type in 12 different things and say, “Yep, feels right to me.” And that’s insane to me. 

27.57: And even if you’re doing that—which I think is a good thing and you may not go create coverage and eval, you have some taste. . . And I do think when you’re building more qualitative tools. . . So a good example is like if you’re Character.AI or you’re Portola Labs, who’s building essentially personalized emotional chatbots, it’s going to be harder to create evals and it’s going to require taste as you build them. But having evals is going to ensure that your whole thing didn’t fall apart because you changed one sentence, which sadly is a risk because these are probabilistic software.

28.33: Honestly, evals are super important. Number one, because, basically, leaderboards like LMArena are great for narrowing your options. But at the end of the day, you still need to benchmark all of these against your own application use case and domain. And then secondly, obviously, it’s an ongoing thing. So it ties in with reliability. The more reliable your application is, that means most likely you’re doing evals properly in an ongoing fashion. And I really believe that eval and reliability are a moat, because basically what else is your moat? Prompt? That’s not a moat. 

29.21: So first off, violent agreement there. The only asset teams truly have—unless they’re a model builder, which is only a handful—is their eval data. And I would say the counterpart to that is their spec, whatever defines their program, but mostly the eval data. But to the other point about it, like why are people vibe shipping? I think you can get pretty far with vibe shipping and it fools you into thinking that that’s right.

We saw this pattern in the Web 2.0 and social era, which was, you would have the product genius—everybody wanted to be the Steve Jobs, who didn’t hold focus groups, didn’t ask their customers what they wanted. The Henry Ford quote about “They all say faster horses,” and I’m the genius who comes in and tweaks these things and ships them. And that often takes you very far.

30.13: I also think it’s a bias of success. We only know about the ones that succeed, but the best ones, when they grow up and they start to serve an audience that’s way bigger than what they could hold in their head, they start to grow up with AB testing and ABX testing throughout their organization. And a good example of that is Facebook.

Facebook stopped being just some choices and started having to do testing and ABX testing in every aspect of their business. Compare that to Snap, which again, was kind of the last of the great product geniuses to come out. Evan [Spiegel] was heralded as “He’s the product genius,” but I think they ran that too long, and they kept shipping on vibes rather than shipping on ABX testing and growing and, you know, being more boring.

31.04: But again, that’s how you get the global reach. I think there’s a lot of people who probably are really great vibe shippers. And they’re probably having great success doing that. The question is, as their company grows and starts to hit harder times or the growth starts to slow, can that vibe shipping take them over the hump? And I would argue, no, I think you have to grow up and start to have more accountable metrics that, you know, scale to the size of your audience. 

31.34: So in closing. . . We talked about prompt engineering. And then we talked about context engineering. So putting you on the spot. What’s a buzzword out there that either irks you or you think is undertalked about at this point? So what’s a buzzword out there, Drew? 

31.57: [laughs] I mean, I wish you had given me some time to think about it. 

31.58: We are in a hype cycle here. . .

32.02: We’re always in a hype cycle. I don’t like anthropomorphosizing LLMs or AI for a whole host of reasons. One, I think it leads to bad understanding and bad mental models, that means that we don’t have substantive conversations about these things, and we don’t learn how to build really well with them because we think they’re intelligent. We think they’re a PhD in your pocket. We think they’re all of these things and they’re not—they’re fundamentally different. 

I’m not against using the way we think the brain works for inspiration. That’s fine with me. But when you start oversimplifying these and not taking the time to explain to your audience how they actually work—you just say it’s a PhD in your pocket, and here’s the benchmark to prove it—you’re misleading and setting unrealistic expectations. And unfortunately, the market rewards them for that. So they keep going. 

But I also think it just doesn’t help you build sustainable programs because you aren’t actually understanding how it works. You’re just kind of reducing it down to it. AGI is one of those things. And superintelligence, but AGI especially.

33.21: I went to school at UC Santa Cruz, and one of my favorite classes I ever took was a seminar with Donna Haraway. Donna Haraway wrote “A Cyborg Manifesto” in the ’80s. She’s kind of a tech science history feminist lens. You would just sit in that class and your mind would explode, and then at the end, you just have to sit there for like five minutes afterwards, just picking up the pieces. 

She had a great term called “power objects.” A power object is something that we as a society recognize to be incredibly important, believe to be incredibly important, but we don’t know how it works. That lack of understanding allows us to fill this bucket with whatever we want it to be: our hopes, our fears, our dreams. This happened with DNA; this happened with PET scans and brain scans. This happens all throughout science history, down to phrenology and blood types and things that we understand to be, or we believed to be, important, but they’re not. And big data, another one that is very, very relevant. 

34.34: That’s my handle on Twitter. 

34.55: Yeah, there you go. So like it’s, you know, I fill it with Ben Lorica. That’s how I fill that power object. But AI is definitely that. AI is definitely that. And my favorite example of this is when the DeepSeek moment happened, we understood this to be really important, but we didn’t understand why it works and how well it worked.

And so what happened is, if you looked at the news and you looked at people’s reactions to what DeepSeek meant, you could basically find all the hopes and dreams about whatever was important to that person. So to AI boosters, DeepSeek proved that LLM progress is not slowing down. To AI skeptics, DeepSeek proved that AI companies have no moat. To open source advocates, it proved open is superior. To AI doomers, it proved that we aren’t being careful enough. Security researchers worried about the risk of backdoors in the models because it was in China. Privacy advocates worried about DeepSeek’s web services collecting sensitive data. China hawks said, “We need more sanctions.” Doves said, “Sanctions don’t work.” NVIDIA bears said, “We’re not going to need any more data centers if it’s going to be this efficient.” And bulls said, “No, we’re going to need tons of them because it’s going to use everything.”

35.44: And AGI is another term like that, which means everything and nothing. And when the point we’ve reached it comes, isn’t. And compounding that is that it’s in the contract between OpenAI and Microsoft—I forget the exact term, but it’s the statement that Microsoft gets access to OpenAI’s technologies until AGI is achieved.

And so it’s a very loaded definition right now that’s being debated back and forth and trying to figure out how to take [Open]AI into being a for-profit corporation. And Microsoft has a lot of leverage because how do you define AGI? Are we going to go to court to define what AGI is? I almost look forward to that.

36.28: So because it’s going to be that thing, and you’ve seen Sam Altman come out and some days he talks about how LLMs are just software. Some days he talks about how it’s a PhD in your pocket, some days he talks about how we’ve already passed AGI, it’s already over. 

I think Nathan Lambert has some great writing about how AGI is a mistake. We shouldn’t talk about trying to turn LLMs into humans. We should try to leverage what they do now, which is something fundamentally different, and we should keep building and leaning into that rather than trying to make them like us. So AGI is my word for you. 

37.03: The way I think of it is, AGI is great for fundraising, let’s put it that way. 

37.08: That’s basically it. Well, until you need it to have already been achieved, or until you need it to not be achieved because you don’t want any regulation or if you want regulation—it’s kind of a fuzzy word. And that has some really good properties. 

37.23: So I’ll close by throwing in my own term. So prompt engineering, context engineering. . . I will close by saying pay attention to this boring term, which my friend Ion Stoica is now talking more about “systems engineering.” If you look at particularly the agentic applications, you’re talking about systems.

37.55: Can I add one thing to this? Violent agreement. I think that is an underrated. . . 

38.00: Although I think it’s too boring a term, Drew, to take off.

38.03: That’s fine! The reason I like it is because—and you were talking about this when you talk about fine-tuning—is, looking at the way people build and looking at the way I see teams with success build, there’s pretraining, where you’re basically training on unstructured data and you’re just building your base knowledge, your base English capabilities and all that. And then you have posttraining. And in general, posttraining is where you build. I do think of it as a form of interface design, even though you are adding new skills, but you’re teaching reasoning, you’re teaching it validated functions like code and math. You’re teaching it how to chat with you. This is where it learns to converse. You’re teaching it how to use tools and specific sets of tools. And then you’re teaching it alignment, what’s safe, what’s not safe, all these other things. 

But then after it ships, you can still RL that model, you can still fine-tune that model, and you can still prompt engineer that model, and you can still context engineer that model. And back to the systems engineering thing is, I think we’re going to see that posttraining all the way through to a final applied AI product. That’s going to be a real shades-of-gray gradient. It’s going to be. And this is one of the reasons why I think open models have a pretty big advantage in the future is that you’re going to dip down the way throughout that, leverage that. . .

39.32: The only thing that’s keeping us from doing that now is we don’t have the tools and the operating system to align throughout that posttraining to shipping. Once we do, that operating system is going to change how we build, because the distance between posttraining and building is going to look really, really, really blurry. I really like the systems engineering type of approach, but I also think you can also start to see this yesterday [when] Thinking Machines released their first product.

40.04: And so Thinking Machines is Mira [Murati]. Her very hype thing. They launched their first thing, and it’s called Tinker. And it’s essentially, “Hey, you can write a very simple Python code, and then we will do the RL for you or the fine-tuning for you using our cluster of GPU so you don’t have to manage that.” And that is the type of thing that we want to see in a maturing kind of development framework. And you start to see this operating system emerging. 

And it reminds me of the early days of O’Reilly, where it’s like I had to stand up a web server, I had to maintain a web server, I had to do all of these things, and now I don’t have to. I can spin up a Docker image, I can ship to render, I can ship to Vercel. All of these shared complicated things now have frameworks and tooling, and I think we’re going to see a similar evolution from that. And I’m really excited. And I think you have picked a great underrated term. 

40.56: Now with that. Thank you, Drew. 

40.58: Awesome. Thank you for having me, Ben.

💾

Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability

2 October 2025 at 10:31

In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00
Today we have Emmanuel Ameisen. He works at Anthropic on interpretability research. And he also authored an O’Reilly book called Building Machine Learning Powered Applications. So welcome to the podcast, Emmanuel. 

00.22
Thanks, man. I’m glad to be here. 

00.24
As I go through what you and your team do, it’s almost like biology, right? You’re studying these models, but increasingly they look like biological systems. Why do you think that’s useful as an analogy? And am I actually accurate in calling this out?

00.50
Yeah, that’s right. Our team’s mandate is to basically understand how the models work, right? And one fact about language models is that they’re not really written like a program, where somebody sort of by hand described what should happen in that logical branch or this logical branch. Really the way we think about it is they’re almost grown. But what that means is, they’re trained over a large dataset, and on that dataset, they learn to adjust their parameters. They have many, many parameters—often, you know, billions—in order to perform well. And so the result of that is that when you get the trained model back, it’s sort of unclear to you how that model does what it does, because all you’ve done to create it is show it tasks and have it improve at how it does these tasks.

01.48
And so it feels similar to biology. I think the analogy is apt because for analyzing this, you kind of resort to the tools that you would use in that context, where you try to look inside the model [and] see which parts seem to light up in different contexts. You poke and prod in different parts to try to see, “Ah, I think this part of the model does this.” If I just turn it off, does the model stop doing the thing that I think it’s doing? It’s very much not what you would do in most cases if you were analyzing a program, but it is what you would do if you’re trying to understand how a mouse works. 

02.22
You and your team have discovered surprising ways as to how these models do problem-solving, the strategies they employ. What are some examples of these surprising problem-solving patterns? 

02.40
We’ve spent a bunch of time studying these models. And again I should say, whether it’s surprising or not depends on what you were expecting. So maybe there’s a few ways in which they’re surprising. 

There’s various bits of common knowledge about, for example, how models predict one token at a time. And it turns out if you actually look inside the model and try to see how it’s sort of doing its job of predicting text, you’ll find that actually a lot of the time it’s predicting multiple tokens ahead of time. It’s sort of deciding what it’s going to say in a few tokens and presumably in a few sentences to decide what it says now. That might be surprising to people who have heard that [models] are predicting one token at a time. 

03.28
Maybe another one that’s sort of interesting to people is that if you look inside these models and you try to understand what they represent in their artificial neurons, you’ll find that there are general concepts they represent.

So one example I like is you can say, “Somebody is tall,” and then, inside the model, you can find neurons activating for the concept of something being tall. And you can have them all read the same text, but translated in French: “Quelqu’un est grand.” And then you’ll find the same neurons that represent the concept of somebody being tall or active.

So you have these concepts that are shared across languages and that the model represents in one way, which is again, maybe surprising, maybe not surprising, in the sense that that’s clearly the optimal thing to do, or that’s the way that. . . You don’t want to repeat all of your concepts; like in your brain, you don’t want to have a separate French brain, an English brain, ideally. But surprising if you think that these models are mostly doing pattern matching. Then it is surprising that, when they’re processing English text or French text, they’re actually using the same representations rather than leveraging different patterns. 

04.41
[In] the text you just described, is there a material difference between the reasoning and nonreasoning models? 

04.51
We haven’t studied that in depth. I will say that the thing that’s interesting about reasoning models is that when you ask them a question, instead of answering right away for a while, they write some text thinking through the problem, saying oftentimes, “Are you using math or code?” You know, trying to think: “Ah, well, maybe this is the answer. Let me try to prove it. Oh no, it’s wrong.” And so they’ve proven to be good at a variety of tasks that models which immediately answer aren’t good at. 

05.22
And one thing that you might think if you look at reasoning models is that you could just read their reasoning and you would understand how they think. But it turns out that one thing that we did find is that you can look at a model’s reasoning, that it writes down, that it samples, the text it’s writing, right? It’s saying, “I’m now going to do this calculation,” and in some cases when for example, the calculation is too hard, if at the same time you look inside the model’s brain inside its weights, you’ll find that actually it could be lying to you.

It’s not at all doing the math that it says it’s doing. It’s just kind of doing its best guess. It’s taking a stab at it, just based on either context clues from the rest or what it thinks is probably the right answer—but it’s totally not doing the computation. And so one thing that we found is that you can’t quite always trust the reasoning that is output by reasoning models.

06.19
Obviously one of the frequent complaints is around hallucination. So based on what you folks have been learning, are we getting close to a, I guess, much more principled mechanistic explanation for hallucination at this point? 

06.39
Yeah. I mean, I think we’re making progress. We study that in our recent paper, and we found something that’s pretty neat. So hallucinations are cases where the model will confidently say something’s wrong. You might ask the model about some person. You’ll say, “Who’s Emmanuel Ameisen?” And it’ll be like “Ah, it’s the famous basketball player” or something. So it will say something where instead it should have said, “I don’t quite know. I’m not sure who you’re talking about.” And we looked inside the model’s neurons while it’s processing these kinds of questions, and we did a simple test: We asked the model, “Who’s Michael Jordan?” And then we made up some name. We asked it, “Who’s Michael Batkin?” (which it doesn’t know).

And if you look inside there’s something really interesting that happens, which is that basically these models by default—because they’ve been trained to try not to hallucinate—they have this default set of neurons that is just: If you ask me about anyone, I’ll just say no. I’ll just say, “I don’t know.” And the way that the models actually choose to answer is if you mentioned somebody famous enough, like Michael Jordan, there’s neurons for like, “Oh, this person is famous; I definitely know them” that activate and that turns off the neurons that were going to promote the answer for, “Hey, I’m not too sure.” And so that’s why the model answers in the Michael Jordan case. And that’s why it doesn’t answer by default in the Michael Batkin case.

08.09
But what happens if instead now you force the neurons for “Oh, this is a famous person” to turn on even when the person isn’t famous, the model is just going to answer the question. And in fact, what we found is in some hallucination cases, this is exactly what happens. It’s that basically there’s a separate part of the model’s brain, essentially, that’s making the determination of “Hey, do I know this person or not?” And then that part can be wrong. And if it’s wrong, the model’s just going to go on and yammer about that person. And so it’s almost like you have a split mechanism here, where, “Well I guess the part of my brain that’s in charge of telling me I know says, ‘I know.’ So I’m just gonna go ahead and say stuff about this person.” And that’s, at least in some cases, how you get a hallucination. 

08.54
That’s interesting because a person would go, “I know this person. Yes, I know this person.” But then if you actually don’t know this person, you have nothing more to say, right? It’s almost like you forget. Okay, so I’m supposed to know Emmanuel, but I guess I don’t have anything else to say. 

09.15
Yeah, exactly. So I think the way I’ve thought about it is there’s definitely a part of my brain that feels similar to this thing, where you might ask me, you know, “Who was the actor in the second movie of that series?” and I know I know; I just can’t quite recall it at the time. Like, “Ah, you know, this is how they look; they were also in that other movie”—but I can’t think of the name. But the difference is, if that happens, I’m going to say, “Well, listen, man, I think I know, but at the moment I just can’t quite recall it.” Whereas the models are like, “I think I know.” And so I guess I’m just going to say stuff. It’s not that the “Oh, I know” [and] “I don’t know” parts [are] separate. That’s not the problem. It’s that they don’t catch themselves sometimes early enough like you would, where, to your point exactly, you’d just be like, “Well, look, I think I know who this is, but honestly at this moment, I can’t really tell you. So let’s move on.” 

10.10
By the way, this is part of a bigger topic now in the AI space around reliability and predictability, the idea being, I can have a model that’s 95% [or] 99% accurate. And if I don’t know when the 5% or the 1% is inaccurate, it’s quite scary. Right? So I’d rather have a model that’s 60% accurate, but I know exactly when that 60% is. 

10.45
Models are getting better at hallucinations for that reason. That’s pretty important. People are training them to just be better calibrated. If you look at the rates of hallucinations for most models today, they’re so much lower than the previous models. But yeah, I agree. And I think in a sense maybe like there’s a hard question there, which is at least in some of these examples that we looked at, it’s not necessarily that, insofar as what we’ve seen, that you can clearly see just from looking at the inside of the model, oh, the model is hallucinating. What we can see is the model thinks it knows who this person is, and then it’s saying some stuff about this person. And so I think the key bit that would be interesting to do future work on is then try to understand, well, when it’s saying things about people, when it’s saying, you know, this person won this championship or whatever, is there a way there that we can kind of tell whether those are real facts or those are sort of confabulated in a way? And I think that’s still an active area of research. 

11.51
So in the case where you hook up Claude to web search, presumably there’s some sort of citation trail where at least you can check, right? The model is saying it knows Emmanuel and then says who Emmanuel is and gives me a link. I can check, right? 

12.12
Yeah. And in fact, I feel like it’s even more fun than that sometimes. I had this experience yesterday where I was asking the model about some random detail, and it confidently said, “This is how you do this thing.” I was asking how to change the time on a device—it’s not important. And it was like, “This is how you do it.” And then it did a web search and it said, “Oh, actually, I was wrong. You know, according to the search results, that’s how you do it. The initial advice I gave you is wrong.” And so, yeah, I think grounding results in search is definitely helpful for hallucinations. Although, of course, then you have the other problem of making sure that the model doesn’t trust sources that are unreliable. But it does help. 

12.50
Case in point: science. There’s tons and tons of scientific papers now that get retracted. So just because it does a web search, what it should do is also cross-verify that search with whatever database there is for retracted papers.

13:08
And you know, as you think about these things, I think you get an answer like effort-level questions where right now, if you go to Claude, there’s a research mode where you can send it off on a quest and it’ll do research for a long time. It’ll cross-reference tens and tens and tens of sources.

But that will take I don’t know, it depends. Sometimes 10 minutes, sometimes 20 minutes. And so there’s a question like, when you’re asking, “Should I buy these running shoes?” you don’t care, [but] when you’re asking about something serious or you’re going to make an important life decision, maybe you do. I always feel like as the models get better, we also want them to get better at knowing when they should spend 10 seconds or 10 minutes on something. 

13.47
There’s a surprisingly growing number of people who go to these models to ask help in medical questions. And as anyone who uses these models knows, a lot of it comes down to your problem, right? A neurosurgeon will prompt this model about brain surgery very differently than you and me, right? 

14:08
Of course. In fact, that was one of the cases that we studied actually, where we prompted the model with a case that’s similar to one that a doctor would see. Not in the language that you or I would use, but in the sort of like “This patient is age 35 presenting symptoms A, B, and C,” because we wanted to try to understand how the model arrives to an answer. And so the question had all these symptoms. And then we asked the model, “Based on all these symptoms, answer in only one word: What other tests should we run?” Just to force it to do all of its reasoning in its head. I can’t write anything down. 

And what we found is that there were groups of neurons that were activating for each of the symptoms. And then they were two different groups of neurons that were activating for two potential diagnoses, two potential diseases. And then those were promoting a specific test to run, which is sort of a practitioner and a differential diagnosis: The person either has A or B, and you want to run a test to know which one it is. And then the model suggested the test that would help you decide between A and B. And I found that quite striking because I think again, outside of the question of reliability for a second, there’s a depth of richness to just the internal representations of them all as it does all of this in one word. 

This makes me excited about continuing down this path of trying to understand the model, like the model’s done a full round of diagnosing someone and proposing something to help with the diagnostic just in one forward pass in its head. As we use these models in a bunch of places, I sure really want to understand all of the complex behavior like this that happens in its weights. 

16.01
In traditional software, we have debuggers and profilers. Do you think as interpretability matures our tools for building AI applications, we could have kind of the equivalent of debuggers that flag when a model is going off the rails?

16.24
Yeah. I mean, that’s the hope. I think debuggers are a good comparison actually, because debuggers mostly get used by the person building the application. If I go to, I don’t know, claude.ai or something, I can’t really use the debugger to understand what’s going on in the backend. And so that’s the first state of debuggers, and the people building the models use it to understand the models better. We’re hoping that we’re going to get there at some point. We’re making progress. I don’t want to be too optimistic, but, I think, we’re on a path here where this work I’ve been describing, the vision was to build this big microscope, basically where the model is doing something, it’s answering a question, and you just want to look inside. And just like a debugger will show you basically the states of all of the variables in your program, we want to see the state of all of the neurons in this model.

It’s like, okay. The “I definitely know this person” neuron is on and the “This person is a basketball player” neuron is on—that’s kind of interesting. How do they affect each other? Should they affect each other in that way? So I think in many ways we’re sort of getting to something close where at least you can inspect the execution of your running program like you would with a debugger. You’re inspecting the execution learning model. 

17.46
Of course, then there’s a question of, What do you do with it? That I think is another active area of research where, if you spend some time looking at your debugger, you can say, “Ah, okay, I get it. I initialized this variable the wrong way. Let me fix it.”

We’re not there yet with models, right? Even if I tell you “This is exactly how this is happening and it’s wrong,” then the way that we make them again is we train them. So really, you have to think, “Ah, can we give it other examples that I would learn to do that way?” 

It’s almost like we’re doing neuroscience on a developing child or something. But then our only way to actually improve them is to change the curriculum of their school. So we have to translate from what we saw in their brain to “Maybe they need a little more math. Or maybe they need a little more English class.” I think we’re on that path. I’m pretty excited about it. 

18.33
We also open-sourced the tools to do this a couple months back. And so, you know, this is something that can now be run on open source models. And people have been doing a bunch of experiments with them trying to see if they behave the same way as some of the behaviors that we saw in the Claud models that we studied. And so I think that also is promising. And there’s room for people to contribute if they want to. 

18.56
Do you folks internally inside Anthropic have special interpretability tools—not that the interpretability team uses but [that] now you can push out to other people in Anthropic as they’re using these models? I don’t know what these tools would be. Could be what you describe, some sort of UX or some sort of microscope towards a model. 

19.22
Right now we’re sort of at the stage where the interpretability team is doing most of the microscopic exploration, and we’re building all these tools and doing all of this research, and it mostly happens on the team for now. I think there’s a dream and a vision to have this. . . You know, I think the debugger metaphor is really apt. But we’re still in the early days. 

19.46
You used the example earlier [where] the part of the model “That is a basketball player” lights up. Is that what you would call a concept? And from what I understand, you folks have a lot of these concepts. And by the way, is a concept something that you have to consciously identify, or do you folks have an automatic way of, “Here’s millions and millions of concepts that we’ve identified and we don’t have actual names for some of them yet”?

20.21
That’s right, that’s right. The latter one is the way to think about it. The way that I like to describe it is basically, the model has a bunch of neurons. And for a second let’s just imagine that we can make the comparison to the human brain, [which] also has a bunch of neurons.

Usually it’s groups of neurons that mean something. So it’s like I have these five neurons around. That means that the model’s reading text about basketball or something. And so we want to find all of these groups. And the way that we find them basically is in an automated, unsupervised way.

20.55
The way you can think about it, in terms of how we try to understand what they mean, is maybe the same way that you do in a human brain, where if I had full access to your brain, I could record all of your neurons. And [if] I wanted to know where the basketball neuron was, probably what I would do is I would put you in front of a screen and I would play some basketball videos, and I would see which part of your brain lights up, you know? And then I would play some videos of football and I’d hopefully see some common parts, like the sports part and then the football part would be different. And then I play a video of an apple and then it’d be a completely different part of the brain. 

And that’s basically exactly what we do to understand what these concepts mean in Claude is we just run a bunch of text through and see which part of its weight matrices light up, and that tells us, okay, this is the basketball concept probably. 

The other way we can confirm that we’re right is just we can then turn it off and see if Claude then stops talking about basketball, for example.

21.52
Does the nature of the neurons change between model generations or between types of models—reasoning, nonreasoning, multimodal, nonmultimodal?

22.03
Yeah. I mean, at the base level all the weights of the model are different, so all of the neurons are going to be different. So the sort of trivial answer to your question [is] yes, everything’s changed. 

22.14
But you know, it’s kind of like [in] the brain, the basketball concept is close to the Michael Jordan concept.

22.21
Yeah, exactly. There’s basically commonalities, and you see things like that. We don’t at all have an in-depth understanding of anything like you’d have for the human brain, where it’s like “Ah, this is a map of where the concepts are in the model.” However, you do see that, provided that the models are trained on and doing kind of the same “being a helpful assistant” stuff, they’ll have similar concepts. They’ll all have the basketball concept, and they’ll have a concept for Michael Jordan. And these concepts will be using similar groups of neurons. So there’s a lot of overlap between the basketball concept and the Michael Jordan concept. You’re going to see similar overlap in most models.

23.03
So channeling your previous self, if I were to give you a keynote at a conference and I give you three slides—this is in front of developers, mind you, not ML researchers—what are the one to three things about interpretability research that developers should know about or potentially even implement or do something about today?

23.30
Oh man, it’s a good question. My first slide would say something like models, language models in particular, are complicated, interesting, and they can be understood, and it’s worth spending time to understand them. The point here being, we don’t have to treat them as this mysterious thing. We don’t have to use approximate, “Oh, they’re just next-token predictors or they’re just pattern matters. They’re black boxes.” We can look inside, and we can make progress on understanding them, and we can find a lot of rich structure. That would be slide one.

24.10
Slide two would be the stuff that we talked about at the start of this conversation, which would be, “Here’s three ways your intuitions are wrong.” You know, oftentimes this is, “Look at this example of a model planning many tokens ahead, not just waiting for the next token. And look at this example of the model having these rich representations showing that it’s sort of like actually doing multistep reasoning in its weights rather than just kind of matching to some training data example.” And then I don’t know what my third example would be. Maybe this universal language example we talked about. Complicated, interesting stuff. 

24.44
And then, three: What can you do about it? That’s the third slide. It’s an early research area. There’s not anything that you can take that will make anything that you’re building better today. Hopefully if I’m viewing this presentation in six months or a year, maybe this third slide is different. But for now, that’s what it is.

25.01
If you’re interested about this stuff, there are these open source libraries that let you do this tracing and open source models. Just go grab some small open source model, ask it some weird question, and then just look inside his brain and see what happens.

I think the thing that I respect the most and identify [with] the most about just being an engineer or developer is this willingness to understand all this stubbornness, to understand your program has a bug. Like, I’m going to figure out what it is, and it doesn’t matter what level of abstraction it’s at.

And I would encourage people to use that same level of curiosity and tenacity to look inside these very weird models that are everywhere. Now, those would be my three slides. 

25.49
Let me ask a follow up question. As you know, most teams are not going to be doing much pretraining. A lot of teams will do some form of posttraining, whatever that might be—fine-tuning, some form of reinforcement learning for the more advanced teams, a lot of prompt engineering, prompt optimization, prompt tuning, some sort of context grounding like RAG or GraphRAG.

You know more about how these models work than a lot of people. How would you approach these various things in a toolbox for a team? You’ve got prompt engineering, some fine-tuning, maybe distillation, I don’t know. So put on your posttraining hat, and based on what you know about interpretability or how these models work, how would you go about, systematically or in a principled way, approaching posttraining? 

26.54
Lucky for you, I also used to work on the posttraining team at Anthropic. So I have some experience as well. I think it’s funny, what I’m going to say is the same thing I would have said before I studied these model internals, but maybe I’ll say it in a different way or something. The key takeaway I keep on having from looking at model internals is, “God, there’s a lot of complexity.” And that means they’re able to do very complex reasoning just in latent space inside their weights. There’s a lot of processing that can happen—more than I think most people have an intuition for. And two, that also means that usually, they’re doing a bunch of different algorithms at once for everything they do.

So they’re solving problems in three different ways. And a lot of times, the weird mistakes you might see when you’re looking at your fine-tuning or just looking at the results model is, “Ah, well, there’s three different ways to solve this thing. And the model just kind of picked the wrong one this time.” 

Because these models are already so complicated, I find that the first thing to do is just pretty much always to build some sort of eval suite. That’s the thing that people fail at the most. It doesn’t take that long—it usually takes an afternoon. You just write down 100 examples of what you want and what you don’t want. And then you can get incredibly far by just prompt engineering and context engineering, or just giving the model the right context.

28.34
That’s my experience, having worked on fine-tuning models that you only want to resort to if everything else fails. I mean, it’s pretty rare that everything else fails, especially with the models getting better. And so, yeah, understanding that, in principle, the models have an immense amount of capacity and it’s just your job to tease that capacity out is the first thing I would say. Or the second thing, I guess, after just, build some evals.

29.00
And with that, thank you, Emmanuel. 

29.03
Thanks, man.

Generative AI in the Real World: Faye Zhang on Using AI to Improve Discovery

18 September 2025 at 06:12

In this episode, Ben Lorica and AI engineer Faye Zhang talk about discoverability: how to use AI to build search and recommendation engines that actually find what you want. Listen in to learn how AI goes way beyond simple collaborative filtering—pulling in many different kinds of data and metadata, including images and voice, to get a much better picture of what any object is and whether or not it’s something the user would want.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

0:00: Today we have Faye Zhang of Pinterest, where she’s a staff AI engineer. And so with that, very welcome to the podcast.

0:14: Thanks, Ben. Huge fan of the work. I’ve been fortunate to attend both the Ray and NLP Summits. I know where you serve as chairs. I also love the O’Reilly AI podcast. The recent episode on A2A and the one with Raiza Martin on NotebookLM have been really inspirational. So, great to be here. 

0:33: All right, so let’s jump right in. So one of the first things I really wanted to talk to you about is this work around PinLanding. And you’ve published papers, but I guess at a high level, Faye, maybe describe for our listeners: What problem is PinLanding trying to address?

0:53: Yeah, that’s a great question. I think, in short, trying to solve this trillion-dollar discovery crisis. We’re living through the greatest paradox of the digital economy. Essentially, there’s infinite inventory but very little discoverability. Picture one example: A bride-to-be asks ChatGPT, “Now, find me a wedding dress for an Italian summer vineyard ceremony,” and she gets great general advice. But meanwhile, somewhere in Nordstrom’s hundreds of catalogs, there sits the perfect terracotta Soul Committee dress, never to be found. And that’s a $1,000 sale that will never happen. And if you multiply this by a billion searches across Google, SearchGPT, and Perplexity, we’re talking about a $6.5 trillion market, according to Shopify’s projections, where every failed product discovery is money left on the table. So that’s what we’re trying to solve—essentially solve the semantic organization of all platforms versus user context or search. 

2:05: So, before PinLanding was developed, and if you look across the industry and other companies, what would be the default—what would be the incumbent system? And what would be insufficient about this incumbent system?

2:22: There have been researchers across the past decade working on this problem; we’re definitely not the first one. I think number one is to understand the catalog attribution. So, back in the day, there was multitask R-CNN generation, as we remember, [that could] identify fashion shopping attributes. So you would pass in-system an image. It would identify okay: This shirt is red and that material may be silk. And then, in recent years, because of the leverage of large scale VLM (vision language models), this problem has been much easier. 

3:03: And then I think the second route that people come in is via the content organization itself. Back in the day, [there was] research on join graph modeling on shared similarity of attributes. And a lot of ecommerce stores also do, “Hey, if people like this, you might also like that,” and that relationship graph gets captured in their organization tree as well. We utilize a vision large language model and then the foundation model CLIP by OpenAI to easily recognize what this content or piece of clothing could be for. And then we connect that between LLMs to discover all possibilities—like scenarios, use case, price point—to connect two worlds together. 

3:55: To me that implies you have some rigorous eval process or even a separate team doing eval. Can you describe to us at a high level what is eval like for a system like this? 

4:11: Definitely. I think there are internal and external benchmarks. For the external ones, it’s the Fashion200K, which is a public benchmark anyone can download from Hugging Face, on a standard of how accurate your model is on predicting fashion items. So we measure the performance using the recall top-k metrics, which says whether the label appears among the top-end prediction attribute accurately, and as a result, we were able to see 99.7% recall for the top ten.

4:47: The other topic I wanted to talk to you about is recommendation systems. So obviously there’s now talk about, “Hey, maybe we can go beyond correlation and go towards reasoning.” Can you [tell] our audience, who may not be steeped in state-of-the-art recommendation systems, how you would describe the state of recommenders these days?

5:23: For the past decade, [we’ve been] seeing tremendous movement from foundational shifts on how RecSys essentially operates. Just to call out a few big themes I’m seeing across the board: Number one, it’s kind of moving from correlation to causation. Back then it was, hey, a user who likes X might also like Y. But now we actually understand why contents are connected semantically. And our LLM AI models are able to reason about the user preferences and what they actually are. 

5:58: The second big theme is probably the cold start problem, where companies leverage semantic IDs to solve the new item by encoding content, understanding the content directly. For example, if this is a dress, then you understand its color, style, theme, etc. 

6:17: And I think of other bigger themes we’re seeing; for example, Netflix is merging from [an] isolated system into a unified intelligence. Just this past year, Netflix [updated] their multitask architecture where [they] shared representations, into one they called the UniCoRn system to enable company-wide improvement [and] optimizations. 

6:44: And very lastly, I think on the frontier side—this is actually what I learned at the AI Engineer Summit from YouTube. It’s a DeepMind collaboration, where YouTube is now using a large recommendation model, essentially teaching Gemini to speak the language of YouTube: of, hey, a user watched this video, then what might [they] watch next? So a lot of very exciting capabilities happening across the board for sure. 

7:15: Generally it sounds like the themes from years past still map over in the following sense, right? So there’s content—the difference being now you have these foundation models that can understand the content that you have more granularly. It can go deep into the videos and understand, hey, this video is similar to this video. And then the other source of signal is behavior. So those are still the two main buckets?

7:53: Correct. Yes, I would say so. 

7:55: And so the foundation models help you on the content side but not necessarily on the behavior side?

8:03: I think it depends on how you want to see it. For example, on the embedding side, which is a kind of representation of a user entity, there have been transformations [since] back in the day with the BERT Transformer. Now it’s got long context encapsulation. And those are all with the help of LLMS. And so we can better understand users, not to next or the last clicks, but to “hey, [in the] next 30 days, what might a user like?” 

8:31: I’m not sure this is happening, so correct me if I’m wrong. The other thing that I would imagine that the foundation models can help with is, I think for some of these systems—like YouTube, for example, or maybe Netflix is a better example—thumbnails are important, right? The fact now that you have these models that can generate multiple variants of a thumbnail on the fly means you can run more experiments to figure out user preferences and user tastes, correct? 

9:05: Yes. I would say so. I was lucky enough to be invited to one of the engineer network dinners, [and was] speaking with the engineer who actually works on the thumbnails. Apparently it was all personalized, and the approach you mentioned enabled their rapid iteration of experiments, and had definitely yielded very positive results for them. 

9:29: For the listeners who don’t work on recommendation systems, what are some general lessons from recommendation systems that generally map to other forms of ML and AI applications? 

9:44: Yeah, that’s a great question. A lot of the concepts still apply. For example, the knowledge distillation. I know Indeed was trying to tackle this. 

9:56: Maybe Faye, first define what you mean by that, in case listeners don’t know what that is. 

10:02: Yes. So knowledge distillation is essentially, from a model sense, learning from a parent model with larger, bigger parameters that has better world knowledge (and the same with ML systems)—to distill into smaller models that can operate much faster but still hopefully encapsulate the learning from the parent model. 

10:24: So I think what Indeed back then faced was the classic precision versus recall in production ML. Their binary classifier needs to really filter out the batch job that you would recommend to the candidates. But this process is obviously very noisy, and sparse training data can cause latency and also constraints. So I think back in the work they published, they couldn’t really get effective separate résumé content from Mistral and maybe Llama 2. And then they were happy to learn [that] out-of-the-box GPT-4 achieved something like 90% precision and recall. But obviously GPT-4 is more expensive and has close to 30 seconds of inference time, which is much slower.

11:21: So I think what they do is use the distillation concept to fine-tune GPT 3.5 on labeled data, and then distill it into a lightweight BERT-based model using the temperature scale softmax, and they’re able to achieve millisecond latency and a comparable recall-precision trade-off. So I think that’s one of the learnings we see across the industry that the traditional ML techniques still work in the age of AI. And I think we’re going to see a lot more in the production work as well. 

11:57: By the way, one of the underappreciated things in the recommendation system space is actually UX in some ways, right? Because basically good UX for delivering the recommendations actually can move the needle. How you actually present your recommendations might make a material difference.  

12:24: I think that’s very much true. Although I can’t claim to be an expert on it because I know most recommendation systems deal with monetization, so it’s tricky to put, “Hey, what my user clicks on, like engage, send via social, versus what percentage of that…

12:42: And it’s also very platform specific. So you can imagine TikTok as one single feed—the recommendation is just on the feed. But YouTube is, you know, the stuff on the side or whatever. And then Amazon is something else. Spotify and Apple [too]. Apple Podcast is something else. But in each case, I think those of us on the outside underappreciate how much these companies invest in the actual interface.

13:18: Yes. And I think there are multiple iterations happening on any day, [so] you might see a different interface than your friends or family because you’re actually being grouped into A/B tests. I think this is very much true of [how] the engagement and performance of the UX have an impact on a lot of the search/rec system as well, beyond the data we just talked about. 

13:41: Which brings to mind another topic that is also something I’ve been interested in, over many, many years, which is this notion of experimentation. Many of the most successful companies in the space actually have invested in experimentation tools and experimentation platforms, where people can run experiments at scale. And those experiments can be done much more easily and can be monitored in a much more principled way so that any kind of things they do are backed by data. So I think that companies underappreciate the importance of investing in such a platform. 

14:28: I think that’s very much true. A lot of larger companies actually build their own in-house A/B testing experiment or testing frameworks. Meta does; Google has their own and even within different cohorts of products, if you’re monetization, social. . . They have their own niche experimentation platform. So I think that thesis is very much true. 

14:51: The last topic I wanted to talk to you about is context engineering. I’ve talked to numerous people about this. So every six months, the context window for these large language models expands. But obviously you can’t just stuff the context window full, because one, it’s inefficient. And two, actually, the LLM can still make mistakes because it’s not going to efficiently process that entire context window anyway. So talk to our listeners about this emerging area called context engineering. And how is that playing out in your own work? 

15:38: I think this is a fascinating topic, where you will hear people passionately say, “RAG is dead.” And it’s really, as you mentioned, [that] our context window gets much, much bigger. Like, for example, back in April, Llama 4 had this staggering 10 million token context window. So the logic behind this argument is quite simple. Like if the model can indeed handle millions of tokens, why not just dump everything instead of doing a retrieval?

16:08: I think there are quite a few fundamental limitations towards this. I know folks from contextual AI are passionate about this. I think number one is scalability. A lot of times in production, at least, your knowledge base is measured in terabytes or petabytes. So not tokens. So something even larger. And number two I think would be accuracy.

16:33: The effective context windows are very different. Honestly, what we see and then what is advertised in product launches. We see performance degrade long before the model reaches its “official limits.” And then I think number three is probably the efficiency and that kind of aligns with, honestly, our human behavior as well. Like do you read an entire book every time you need to answer one simple question? So I think the context engineering [has] slowly evolved from a buzzword, a few years ago, to now an engineering discipline. 

17:15: I’m appreciative that the context windows are increasing. But at some level, I also acknowledge that to some extent, it’s also kind of a feel-good move on the part of the model builders. So it makes us feel good that we can put more things in there, but it may not actually help us answer the question precisely. Actually, a few years ago, I wrote kind of a tongue-and-cheek post called “Structure Is All You Need.” So basically whatever structure you have, you should help the model, right? If it’s in a SQL database, then maybe you can expose the structure of the data. If it’s a knowledge graph, you leverage whatever structure you have to provide the model better context. So this whole notion of just stuffing the model with as much information, for all the reasons you gave, is valid. But also, philosophically, it doesn’t make any sense to do that anyway.

18:30: What are the things that you are looking forward to, Faye, in terms of foundation models? What kinds of developments in the foundation model space are you hoping for? And are there any developments that you think are below the radar? 

18:52: I think, to better utilize the concept of “contextual engineering,” that they’re essentially two loops. There’s number one within the loop of what happened. Yes. Within the LLMs. And then there’s the outer loop. Like, what can you do as an engineer to optimize a given context window, etc., to get the best results out of the product within the context loop. There are multiple tricks we can do: For example, there’s the vector plus Excel or regex extraction. There’s the metadata fillers. And then for the outer loop—this is a very common practice—people are using LLMs as a reranker, sometimes across the encoder. So the thesis is, hey, why would you overburden an LLM with a 20,000 ranking when there are things you can do to reduce it to top hundred or so? So all of this—context assembly, deduplication, and diversification—would help our production [go] from a prototype to something [that’s] more real time, reliable, and able to scale more infinitely. 

20:07: One of the things I wish—and I don’t know, this is wishful thinking—is maybe if the models can be a little more predictable, that would be nice. By that, I mean, if I ask a question in two different ways, it’ll basically give me the same answer. The foundation model builders can somehow increase predictability and maybe provide us with a little more explanation for how they arrive at the answer. I understand they’re giving us the tokens, and maybe some of the, some of the reasoning models are a little more transparent, but give us an idea of how these things work, because it’ll impact what kinds of applications we’d be comfortable deploying these things in. For example, for agents. If I’m using an agent to use a bunch of tools, but I can’t really predict their behavior, that impacts the types of applications I’d be comfortable using a model for. 

21:18: Yeah, definitely. I very much resonate with this, especially now most engineers have, you know, AI empowered coding tools like Cursor and Windsurf—and as an individual, I very much appreciate the train of thought you mentioned: why an agent does certain things. Why is it navigating between repositories? What are you looking at while you’re doing this call? I think these are very much appreciated. I know there are other approaches—look at Devin, that’s the fully autonomous engineer peer. It just takes things, and you don’t know where it goes. But I think in the near future there will be a nice marriage between the two. Well, now since Windsurf is part of Devin’s parent company. 

22:05: And with that, thank you, Faye.

22:08: Awesome. Thank you, Ben.

Generative AI in the Real World: Luke Wroblewski on When Databases Talk Agent-Speak

4 September 2025 at 12:01

Join Luke Wroblewski and Ben Lorica as they talk about the future of software development. What happens when we have databases that are designed to interact with agents and language models rather than humans? We’re starting to see what that world will look like. It’s an exciting time to be a software developer.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Timestamps

  • 0:00: Introduction to Luke Wroblewski of Sutter Hill Ventures. 
  • 0:36: You’ve talked about a paradigm shift in how we write applications. You’ve said that all we need is a URL and model, and that’s an app. Has anyone else made a similar observation? Have you noticed substantial apps that look like this?
  • 1:08: The future is here; it’s just not evenly distributed yet. That’s what everyone loves to say. The first websites looked nothing like robust web applications, and now we have a multimedia podcast studio running in the browser. We’re at the phase where some of these things look and feel less robust. And our ideas for what constitutes an application change in each of these phases. If I told you pre-Google Maps that we’d be running all of our web applications in a browser, you’d have laughed at me. 
  • 2:13: I think what you mean is an MCP server, and the model itself is the application, correct?
  • 2:24: Yes. The current definition of an application, in a simple form, is running code and a database. We’re at the stage where you have AI coding agents that can handle the coding part. But we haven’t really had databases that have been designed for the way those agents think about code and interacting with data.
  • 2:57: Now that we have databases that work the way agents work, you can take out the running-code part almost. People go to Lovable or Cursor and they’re forced to look at code syntax. But if an AI model can just use a database effectively, it takes the role of the running code. And if it can manage data visualizations and UI, you don’t need to touch the code. You just need to point the AI at a data structure it can use effectively. MCP UI is a nice example of people pushing in this direction.
  • 4:12: Which brings us to something you announced recently: AgentDB. You can find it at agentdb.dev. What problem is AgentDB trying to solve?
  • 4:34: Related to what we were just talking about: How do we get AI agents to use databases effectively? Most things in the technology stack are made for humans and the scale at which humans operate.
  • 5:06: They’re still designed for a DBA, but eliminating the command line, right? So you still have to have an understanding of DBA principles?
  • 5:19: How do you pick between the different compute options? How do you pick a region? What are the security options? And it’s not something you’re going to do thousands of times a day. Databricks just shared some stats where they said that thousands of databases per agent get made a day. They think 99% of databases being made are going to be made by agents. What is making all these databases? No longer humans. And the scale at which they make them—thousands is a lowball number. It will be way, way higher than that. How do we make a database system that works in that reality?
  • 6:22: So the high-level thesis here is that lots of people will be creating agents, and these agents will rely on something that looks like a database, and many of these people won’t be hardcore engineers. What else?
  • 6:45: It’s also agents creating agents, and agents creating applications, and agents deciding they need a database to complete a task. The explosion of these smart machine uses and workflows is well underway. But we don’t have an infrastructure that was made for that world. They were all designed to work with humans.
  • 7:31: So in the classic database world, you’d consider AgentDB more like OLTP rather than analytics and OLAP.
  • 7:42: Yeah, for analytics you’d probably stick your log somewhere else. The characteristics that make AgentDB really interesting for agents is, number 1: To create a database, all you really need is a unique ID. The creation of the ID manifests a database out of thin air. And we store it as a file, so you can scale like crazy. And all of these databases are fully isolated. They’re also downloadable, deletable, releasable—all the characteristics of a filesystem. We also have the concept of a template that comes along with the database. That gives the AI model or agent all the context it needs to start using the database immediately. If you just point Claude at a database, it will need to look at the structure (schema). It will build tokens and time trying to get the structure of the information. And every time it does this is an opportunity to make a mistake. With AgentDB, when an agent or an AI model is pointed at the database with a template, it can immediately write a query because we have in there a description of the database, the schema. So you save time, cut down errors, and don’t have to go through that learning step every time the model touches a database.
  • 10:22: I assume this database will have some of the features you like, like ACID, vector search. So what kinds of applications have people built using AgentDB? 
  • 10:53: We put up a little demo page where we allow you to start the process with a CSV file. You upload it, and it will create the database and give you an MCP URL. So people are doing things like personal finance. People are uploading their credit card statements, their bank statements, because those applications are horrendous.
  • 11:39: So it’s the actual statement; it parses it?
  • 11:45: Another example: Someone has a spreadsheet to track jobs. They can take that, upload it, it gives them a template and a database and an MCP URL. They can pop that job-tracking database into Claude and do all the things you can do with a chat app, like ask, “What did I look at most recently?”
  • 12:35: Do you envision it more like a DuckDB, more embedded, not really intended for really heavy transactional, high-throughput, more-than-one-table complicated schemas?
  • 12:49: We currently support DuckDB and SQLite. But there are a bunch of folks who have made multiple table apps and databases.
  • 13:09: So it’s not meant for you to build your own CRM?
  • 13:18: Actually, one of our go-to-market guys had data of people visiting the website. He can dump that as a spreadsheet. He has data of people starring repos on GitHub. He has data of people who reached out through this form. He has all of these inbound signals of customers. So he took those, dropped them in as CSV files, put it in Claude, and then he can say, “Look at these, search the web for information about these, add it to the database, sort it by priority, assign it to different reps.” It’s CRM-ish already, but super-customized to his particular use case. 
  • 14:27: So you can create basically an agentic Airtable.
  • 14:38: This means if you’re building AI applications or databases—traditionally that has been somewhat painful. This removes all that friction.
  • 15:00: Yes, and it leads to a different way of making apps. You take that CSV file, you take that MCP URL, and you have a chat app.
  • 15:17: Even though it’s accessible to regular users, it’s something developers should consider, right?
  • 15:25: We’re starting to see emergent end-user use cases, but what we put out there is for developers. 
  • 15:38: One of the other things you’ve talked about is the notion that software development has flipped. Can you explain that to our listeners?
  • 15:56: I spent eight and a half years at Google, four and a half at Yahoo, two and a half at ebay, and your traditional process of what we’re going to do next is up front: There’s a lot of drawing pictures and stuff. We had to scope engineering time. A lot of the stuff was front-loaded to figure out what we were going to build. Now with things like AI agents, you can build it and then start thinking about how it integrates inside the project. At a lot of our companies that are working with AI coding agents, I think this naturally starts to happen, that there’s a manifestation of the technology that helps you think through what the design should be, how do we integrate into the product, should we launch this? This is what I mean by “flipped.”
  • 17:41: If I’m in a company like a big bank, does this mean that engineers are running ahead?
  • 17:55: I don’t know if it’s happening in big banks yet, but it’s definitely happening in startup companies. And design teams have to think through “Here’s a bunch of stuff, let me do a wash across all that to fit in,” as opposed to spending time designing it earlier. There are pros and cons to both of these. The engineers were cleaning up the details in the previous world. Now the opposite is true: I’ve built it, now I need to design it.
  • 18:55: Does this imply a new role? There’s a new skill set that designers have to develop?
  • 19:07: There’s been this debate about “Should designers code?” Over the years lots of things have reduced the barrier to entry, and now we have an even more dramatic reduction. I’ve always been of the mindset that if you understand the medium, you will make better things. Now there’s even less of a reason not to do it.
  • 19:50: Anecdotally, what I’m observing is that the people who come from product are able to build something, but I haven’t heard as many engineers thinking about design. What are the AI tools for doing that?
  • 20:19: I hear the same thing. What I hope remains uncommoditized is taste. I’ve found that it’s very hard to teach taste to people. If I have a designer who is a good systems thinker but doesn’t have the gestalt of the visual design layer, I haven’t been able to teach that to them. But I have been able to find people with a clear sense of taste from diverse design backgrounds and get them on board with interaction design and systems thinking and applications.
  • 21:02: If you’re a young person and you’re skilled, you can go into either design or software engineering. Of course, now you’re reading articles saying “forget about software engineering.” I haven’t seen articles saying “forget about design.”
  • 21:31: I disagree with the idea that it’s a bad time to be an engineer. It’s never been more exciting.
  • 21:46: But you have to be open to that. If you’re a curmudgeon, you’re going to be in trouble.
  • 21:53: This happens with every technical platform transition. I spent so many years during the smartphone boom hearing people say, “No one is ever going to watch TV and movies on mobile.” Is it an affinity to the past, or a sense of doubt about the future? Every time, it’s been the same thing.
  • 22:37: One way to think of AgentDB is like a wedge. It addresses one clear pain point in the stack that people have to grapple with. So what’s next? Is it Kubernetes?
  • 23:09: I don’t want to go near that one! The broader context of how applications are changing—how do I create a coherent product that people understand how to use, that has aesthetics, that has a personality?—is a very wide-open question. There’s a bunch of other systems that have not been made for AI models. A simple example is search APIs. Search APIs are basically structured the same way as results pages. Here’s your 10 blue links. But an agentic model can suck up so much information. Not only should you be giving it the web page, you should be giving it the whole site. Those systems are not built for this world at all. You can go down the list of the things we use as core infrastructure and think about how they were made for a human, not the capabilities of an enormous large language model.
  • 24:39: Right now, I’m writing an article on enterprise search, and one of things people don’t realize is that it’s broken. In terms of AgentDB, do you worry about things like security, governance? There’s another place black hat attackers can go after.
  • 25:20: Absolutely. All new technologies have the light side and the dark side. It’s always been a codebreaker-codemaker stack. That doesn’t change. The attack vectors are different and, in the early stages, we don’t know what they are, so it is a cat and mouse game. There was an era when spam in email was terrible; your mailbox would be full of spam and you manually had to mark things as junk. Now you use gmail, and you don’t think about it. When was the last time you went into the junk mail tab? We built systems, we got smarter, and the average person doesn’t think about it.
  • 26:31: As you have more people building agents, and agents building agents, you have data governance, access control; suddenly you have AgentDB artifacts all over the place. 
  • 27:06: Two things here. This is an underappreciated part of this. Two years ago I launched my own personal chatbot that works off my writings. People ask me what model am I using, and how is it built? Those are partly interesting questions. But the real work in that system is constantly looking at the questions people are asking, and evaluating whether or not it responded well. I’m constantly course-correcting the system. That’s the work that a lot of people don’t do. But the thing I’m doing is applying taste, applying a perspective, defining what “good” is. For a lot of systems like enterprise search, it’s like, “We deployed the technology.” How do you know if it’s good or not? Is someone in there constantly tweaking and tuning? What makes Google Search so good? It’s constantly being re-evaluated. Or Google Translate—was this translation good or bad? Baked in early on.

❌
❌