Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

The Human Behind the Door

23 January 2026 at 07:14
The following article originally appeared on Mike Amundsen’s Substack Signals from Our Futures Past and is being republished here with the author’s permission.

There’s an old hotel on a windy corner in Chicago where the front doors shine like brass mirrors. Each morning, before guests even reach the step, a tall man in a gray coat swings one open with quiet precision. He greets them by name, gestures toward the elevator, and somehow makes every traveler feel like a regular. To a cost consultant, he is a line item. To the guests, he is part of the building’s atmosphere.

When management installed automatic doors a few years ago, the entrance became quieter and cheaper, but not better. Guests no longer lingered to chat, taxis stopped less often, and the lobby felt colder. The automation improved the hotel’s bottom line but not its character.

This story captures what British advertising executive Rory Sutherland calls “The Doorman Fallacy,” the habit of mistaking visible tasks for the entirety of a role. In this short video explanation, Sutherland points out that a doorman does more than open doors. He represents safety, care, and ceremony. His presence changes how people feel about a place. Remove him, and you save money but lose meaning.

The Lesson Behind the Metaphor

Sutherland expanded on the idea in his 2019 book Alchemy, arguing that logic alone can lead organizations astray. We typically undervalue the intangible parts of human work because they do not fit neatly into a spreadsheet. For example, the doorman seems redundant only if you assume his job is merely mechanical. In truth, he performs a social and symbolic function. He welcomes guests, conveys prestige, and creates a sense of safety.

Of course, this lesson extends well beyond hotels. In business after business, human behavior is treated as inefficiency. The result is thinner experiences, shallower relationships, and systems that look streamlined on paper but feel hollow in practice.

The Doorman in the Age of AI

In a recent article for The Conversation, Gediminas Lipnickas of the University of South Australia argues that many companies are repeating the same mistake with artificial intelligence. He warns people against the tendency to replace people because technology can imitate their simplest tasks while ignoring the judgment, empathy, and adaptability that define the job.

Lipnickas offers two examples.

The Commonwealth Bank of Australia laid off 45 customer service agents after rolling out a voice bot, then reversed the decision when it realized the employees were not redundant. They were context interpreters, not just phone operators.

Taco Bell introduced AI voice ordering at drive-throughs to speed up service, but customers complained of errors, confusion, and surreal exchanges with synthetic voices. The company paused the rollout and conceded that human improvisation worked better, especially during busy periods.

Both cases reveal the same pattern: Automation succeeds technically but fails experientially. It is the digital version of installing an automatic door and wondering why the lobby feels empty.

Measuring the Wrong Thing

The doorman fallacy persists because organizations keep measuring only what is visible. Performance dashboards reward tidy numbers, calls answered, tickets closed, customer contacts avoided, because they are easy to track. But they miss the essence of the work: problem-solving, reassurance, and quiet support.

When we optimize for visible throughput instead of invisible value, we teach everyone to chase efficiency at the expense of meaning. A skilled agent does not just resolve a complaint; they interpret tone and calm frustration. A nurse does not merely record vitals; they notice hesitation that no sensor can catch. A line cook does not just fill orders; they maintain the rhythm of a kitchen.

The answer is not to stop measuring; it is to do a better job of measuring. Key results should focus on interaction, problem-solving, and support, not just volume and speed. Otherwise, we risk automating away the very parts of work that make it valuable.

Efficiency versus empathy

Sutherland’s insight and Lipnickas’s warning meet at the same point: When efficiency ignores empathy, systems break down. Automation works well for bounded, rule-based tasks such as data entry, image processing, or predictive maintenance. But as soon as creativity, empathy, and creative problem-solving enter the picture, humans remain indispensable.

What looks like inefficiency on paper is often resilience in practice. A doorman who pauses to chat with a regular guest may appear unproductive, yet that moment strengthens loyalty and reputation in ways no metric can show.

Coaching, not replacing

That is why my own work has focused on using AI as a coach or mentor, not as a worker. A well-designed AI coach can prompt reflection, offer structure, and accelerate learning, but it still relies on human curiosity to drive the process. The machine can surface possibilities, but only the person can decide what matters.

When I design an AI coach, I think of it as a partner in thought, closer to Douglas Engelbart’s idea of human-computer partnership than to a substitute employee. The coach asks questions, provides scaffolding, and amplifies creativity. It does not replace the messy, interpretive work that defines human intelligence.

A More Human Kind of Intelligence

The deeper lesson of the doorman fallacy is that intelligence is not a property of isolated systems but of relationships. The doorman’s value emerges in the interplay between person and place, gesture and response. The same is true for AI. Detached from human context, it becomes thin and mechanical. Driven by human purpose, it becomes powerful and humane.

Every generation of innovation faces this tension. The industrial revolution promised to free us from labor but often stripped away craftsmanship. The digital revolution promises connection but frequently delivers distraction. Now the AI revolution promises efficiency, but unless we are careful, it may erode the very qualities that make work worth doing.

As we rush to install the next generation of technological “automatic doors,” let us remember the person who once stood beside them. Not out of nostalgia but because the future belongs to those who still know how to welcome others in.

You can find out just how Mike uses AI as an assistant by joining him on February 11 on the O’Reilly learning platform for his live course AI-Driven API Design. He’ll take you through integrating AI-assisted automation into human-driven API design and leveraging AI tools like ChatGPT to optimize the design, documentation, and testing of web APIs. It’s free for O’Reilly members; register here.

Not a member?
Sign up for a 10-day free trial before the event to attend—and explore all the other resources on O’Reilly.

AI in the Office

22 January 2026 at 07:12

My father spent his career as an accountant for a major public utility. He didn’t talk about work much; when he engaged in shop talk, it was generally with other public utility accountants, and incomprehensible to those who weren’t. But I remember one story from work, and that story is relevant to our current engagement with AI.

He told me one evening about a problem at work. This was the late 1960s or early 1970s, and computers were relatively new. The operations division (the one that sends out trucks to fix things on poles) had acquired a number of “computerized” systems for analyzing engines—no doubt an early version of what your auto repair shop uses all the time. (And no doubt much larger and more expensive.) There was a question of how to account for these machines: Are they computing equipment? Or are they truck maintenance equipment? And it had turned into a kind of turf war between the operations people and the people we’d now call IT. (My father’s job was less about adding up long columns of figures than about making rulings on accounting policy issues like this; I used to call it “philosophy of accounting,” with my tongue not entirely in my cheek.)

My immediate thought was that this was a simple problem. The operations people probably want this to be considered computer equipment to keep it off their budget; nobody wants to overspend their budget. And the computing people probably don’t want all this extra equipment dumped onto their budget. It turned out that was exactly wrong. Politics is all about control, and the computer group wanted control of these strange machines with new capabilities. Did operations know how to maintain them? In the late ’60s, it’s likely that these machines were relatively fragile and contained components like vacuum tubes. Likewise, the operations group really didn’t want the computer group controlling how many of these machines they could buy and where to place them; the computer people would probably find something more fun to do with their money, like leasing a bigger mainframe, and leaving operations without the new technology. In the 1970s, computers were for getting the bills out, not mobilizing trucks to fix downed lines.

I don’t know how my father’s problem was resolved, but I do know how that relates to AI. We’ve all seen that AI is good at a lot of things—writing software, writing poems, doing research—we all know the stories. Human language may yet become a very-high-level, the highest-possible-level, programming language—the abstraction to end all abstractions. It may allow us to reach the holy grail: telling computers what we want them to do, not how (step-by-step) to do it. But there’s another part of enterprise programming, and that’s deciding what we want computers to do. That involves taking into account business practices, which are rarely as uniform as we’d like to think; hundreds of cross-cutting and possibly contradictory regulations; company culture; and even office politics. The best software in the world won’t be used, or will be used badly, if it doesn’t fit into its environment.

Politics? Yes, and that’s where my father’s story is important. The conflict between operations and computing was politics: power and control in the context of the dizzying regulations and standards that govern accounting at a public utility. One group stood to gain control; the other stood to lose it; and the regulators were standing by to make sure everything was done properly. It’s naive of software developers to think that’s somehow changed in the past 50 or 60 years, that somehow there’s a “right” solution that doesn’t take into account politics, cultural factors, regulation, and more.

Let’s look (briefly) at another situation. When I learned about domain-driven design (DDD), I was shocked to hear that a company could easily have a dozen or more different definitions of a “sale.” Sale? That’s simple. But to an accountant, it means entries in a ledger; to the warehouse, it means moving items from stock onto a truck, arranging for delivery, and recording the change in stocking levels; to sales, a “sale” means a certain kind of event that might even be hypothetical: something with a 75% chance of happening. Is it the programmer’s job to rationalize this, to say “let’s be adults, ‘sale’ can only mean one thing”? No, it isn’t. It is a software architect’s job to understand all the facets of a “sale” and find the best way (or, in Neal Ford and Mark Richards’s words, the “least worst way”) to satisfy the customer. Who is using the software, how are they using it, and how are they expecting it to behave?

Powerful as AI is, thought like this is beyond its capabilities. It might be possible with more “embodied” AI: AI that was capable of sensing and tracking its surroundings, AI that was capable of interviewing people, deciding who to interview, parsing the office politics and culture, and managing the conflicts and ambiguities. It’s clear that, at the level of code generation, AI is much more capable of dealing with ambiguity and incomplete instructions than earlier tools. You can tell Claude “Just write me a simple parser for this document type, I don’t care how you do it.” But it’s not yet capable of working with the ambiguity that’s part of any human office. It isn’t capable of making a reasoned decision about whether these new devices are computers or truck maintenance equipment.

How long will it be before AI can make decisions like those? How long before it can reason about fundamentally ambiguous situations and come up with the “least worst” solution? We will see.

Building AI-Powered SaaS Businesses

21 January 2026 at 07:16

In preparation for our upcoming Building SaaS Businesses with AI Superstream, I sat down with event chair Jason Gilmore to discuss the full lifecycle of an AI-powered SaaS product, from initial ideation all the way to a successful launch.

Jason Gilmore is CTO of Adalo, a popular no-code mobile app builder. A technologist and software product leader with over 25 years of industry experience, Jason’s spent 13 years building SaaS products at companies including Gatherit.co and the highly successful Nomorobo and as the CEO of the coding education platform Treehouse. He’s also a veteran of Xenon Partners, where he leads technical M&A due diligence and advises their portfolio of SaaS companies on AI adoption, and previously served as CTO of DreamFactory.

Here’s our interview, edited for clarity and length.

Ideation

Michelle Smith: As a SaaS developer, what are the first steps you take when beginning the ideation process for a new product?

Jason Gilmore: I always start by finding a name that I love, buying the domain, and then creating a logo. Once I’ve done this, I feel like the idea is becoming real. This used to be a torturous process, but thanks to AI, my process is now quite smooth. I generate product names by asking ChatGPT for 10 candidates, refining them until I have three preferred options, and then checking availability via Lean Domain Search. I usually use ChatGPT to help with logos, but interestingly, while I was using Cursor, the popular AI-powered coding editor, it automatically created a logo for ContributorIQ as it set up the landing page. I hadn’t even asked for one, but it looked great, so I went with it!

Once I nail down a name and logo, I’ll return to ChatGPT yet again and use it like a rubber duck. Of course, I’m not doing any coding or debugging at this point; instead, I’m just using ChatGPT as a sounding board, asking it to expand upon my idea, poke holes in it, and so forth.

Next, I’ll create a GitHub repository and start adding issues (basically feature requests). I’ve used the GitHub kanban board in the past and have also been a heavy Trello user at various times. However, these days I keep it simple and create GitHub issues until I feel I have enough to constitute an MVP. Then I’ll use the GitHub MCP server in conjunction with Claude Code or Cursor to pull and implement these issues.

Before committing resources to development, how do you approach initial validation to ensure the market opportunity exists for a new SaaS product?

The answer to this question is simple. I don’t. If the problem is sufficiently annoying that I eventually can’t resist building something to solve it, then that’s enough for me. That said, once I have an MVP, I’ll start telling everybody I know about it and really try to lower the barrier associated with getting started.

For instance, if someone expresses interest in using SecurityBot, I’ll proactively volunteer to help them validate their site via DNS. If someone wants to give ContributorIQ a try, I’ll ask to meet with the person running due diligence to ensure they can successfully connect to their GitHub organization. It’s in these early stages of customer acquisition that you can determine what users truly want rather than merely trying to replicate what competitors are doing.

Execution, Tools, and Code

When deciding to build a new SaaS product, what’s the most critical strategic question you seek to answer before writing any code?

Personally, the question I ask myself is whether I seriously believe I will use the product every day. If the answer is an adamant yes, then I proceed. If it’s anything but a “heck yes,” then I’ve learned that it’s best to sit on the idea for a few more weeks before investing any additional time.

Which tools do you recommend, and why?

I regularly use a number of different tools for building software, including Cursor and Claude Code for AI-assisted coding and development, Laravel Forge for deployment, Cloudflare and SecurityBot for security, and Google Analytics and Search Console for analytics. Check out my comprehensive list at the end of this article for more details.

How do you accurately measure the success and adoption of your product? What key metrics (KPIs) do you prioritize tracking immediately after launch?

Something I’ve learned the hard way is that being in such a hurry to launch a product means that you neglect to add an appropriate level of monitoring. I’m not necessarily referring to monitoring in the sense of Sentry or Datadog; rather I’m referring to simply knowing when somebody starts a trial.

At a minimum, you should add a restricted admin dashboard to your SaaS which displays various KPIs such as who started a trial and when. You should also be able to quickly determine when trialers reach a key milestone. For instance, at SecurityBot, that key milestone is connecting their Slack, because once that happens, trialers will periodically receive useful notifications right in the very place where they spend a large part of their day.

On build versus buy: What’s your critical decision framework for choosing to use prebuilt frameworks and third-party platforms?

I think it’s a tremendous mistake to try to reinvent the wheel. Frameworks and libraries such as Ruby on Rails, Laravel, Django, and others are what’s known as “batteries included,” meaning they provide everything 99% of what developers require to build a tremendously useful, scalable, and maintainable software product. If your intention is to build a successful SaaS product, then you should focus exclusively on building a quality product and acquiring customers, period. Anything else is just playing with computers. And there’s nothing wrong with playing with computers! It’s my favorite thing to do in the world. But it’s not the same thing as building a software business.

Quality and Security

What unique security and quality assurance (QA) protocols does an intelligent SaaS product require that a standard, non-AI application doesn’t?

The two most important are prompt management and output monitoring. To minimize response drift (the LLM’s tendency for creative, inconsistent interpretation), you should rigorously test and tightly define the LLM prompt. This must be repeatedly tested against diverse datasets to ensure consistent and desired behavior.

Developers should look beyond general OpenAI APIs and consider specialized custom models (like the 2.2 million available on Hugging Face) that are better suited for specific tasks.

To ensure quality and prevent harm, you’ll also need to proactively monitor and review the LLM’s output (particularly when it’s low-confidence or potentially sensitive) and continuously refine and tune the prompt. Keeping a human in the loop (HITL) is essential: At Nomorobo, for instance, we manually reviewed low-confidence robocall categorizations to improve the model. At Adalo, we’ve reviewed thousands of app-building prompt responses to ensure desired outcomes.

Critically, businesses must transparently communicate to users exactly how their data and intellectual property are being used, particularly before passing it to a third-party LLM service.

It’s also important to differentiate when AI is truly necessary. Sometimes, AI can be used most effectively to enhance non-AI tools—for instance, using an LLM to generate complex, difficult-to-write scripts or reviewing schemas for database optimization—rather than trying to solve the core problem with a large, general model.

Marketing, Launch, and Business Success

What are your top two strategies for launching a product?

For early-stage growth, founders should focus intently on two core strategies: prioritizing SEO and proactively promoting the product.

I recommend prioritizing SEO early and aggressively. Currently, the majority of organic traffic still comes from traditional search results, not AI-generated answers (GEO). We are however certainly seeing GEO being attributed to a larger share of visitors. So while you should focus on Google organic traffic, I also suggest spending time tuning your marketing pages for AI crawlers.

Implement a feature-to-landing page workflow: For SecurityBot, nearly all traffic was driven by creating a dedicated SEO-friendly landing page for every new feature. AI tools like Cursor can automate the creation of these pages, including generating necessary assets like screenshots and promotional tweets. Landing pages for features like Broken Link Checker and PageSpeed Insights were 100% created by Cursor and Sonnet 4.5.

Many technical founders hesitate to promote their work, but visibility is crucial. Overcome founder shyness: Be vocal about your product and get it out there. Share your product immediately with friends, colleagues, and former customers to start gaining early traction and feedback.

Mastering these two strategies is more than enough to keep your team busy and effectively drive initial growth.

On scaling: What’s the single biggest operational hurdle when trying to scale your business from a handful of users to a large, paying user base?

I’ve had the opportunity to see business scaling hurdles firsthand, not only at Xenon but also during the M&A process, as well as within my own projects. The biggest operational hurdle, by far, is maintaining focus on customer acquisition. It is so tempting to build “just one more feature” instead of creating another video or writing a blog post.

Conversely, for those companies that do reach a measure of product-market fit, my observation is they tend to focus far too much on customer acquisition at the cost of customer retention. There’s a concept in subscription-based businesses known as “max MRR,” which identifies the point at which your business will simply stop growing once revenue lost due to customer churn reaches an absolute dollar point that erases any revenue gains made through customer acquisition. In short, at a certain point, you need to focus on both, and that’s difficult to do.

We’ll end with monetization. What’s the most successful and reliable monetization strategy you’ve seen for a new AI-powered SaaS feature? Is it usage-based, feature-gated, or a premium tier?

We’re certainly seeing usage-based monetization models take off these days, and I think for certain types of businesses, that makes a lot of sense. However, my advice to those trying to build a new SaaS business is to keep your subscription model as simple and understandable as possible in order to maximize customer acquisition opportunities.

Thanks, Jason.

For more from Jason Gilmore on developing successful SaaS products, join us on February 10 for our AI Superstream: Building SaaS Businesses with AI. Jason and a lineup of AI specialists from Dynatrace, Sendspark, DBGorilla, Changebot, and more will examine every phase of building with AI, from initial ideation and hands-on coding to launch, security, and marketing—and share case studies and hard-won insights from production. Register here; it’s free and open to all.

Appendix: Recommended Tools

CategoryTool/servicePrimary useNotes
AI-assisted codingCursor (with Opus 4.5) and Claude CodeCoding and AI assistanceClaude Opus 4.5 highly valued
Code managementGitHubManaging code repositoriesStandard code management
DeploymentLaravel ForgeDeploying projects to Digital OceanHighly valued for simplifying deployment
API/SaaS interactionMCP serversInteracting with GitHub, Stripe, Chrome devtools, and TrelloCentralized interaction point
ArchitectureMermaidCreating architectural diagramsUsed for visualization
ResearchChatGPTRubber duck debugging and general AI assistanceDedicated tool for problem-solving
SecurityCloudflareSecurity services and blocking bad actorsPrimarily focused on protection
Marketing and SEOGoogle Search ConsoleTracking marketing page performanceFocuses on search visibility
AnalyticsGoogle Analytics 4 (GA4)Site metrics and reportingConsidered a “horrible” but necessary tool due to lack of better alternatives

Generative AI in the Real World: Aurimas Griciūnas on AI Teams and Reliable AI Systems

15 January 2026 at 11:55

SwirlAI founder Aurimas Griciūnas helps tech professionals transition into AI roles and works with organizations to create AI strategy and develop AI systems. Aurimas joins Ben to discuss the changes he’s seen over the past couple years with the rise of generative AI and where we’re headed with agents. Aurimas and Ben dive into some of the differences between ML-focused workloads and those implemented by AI engineers—particularly around LLMOps and agentic workflows—and explore some of the concerns animating agent systems and multi-agent systems. Along the way, they share some advice for keeping your talent pipeline moving and your skills sharp. Here’s a tip: Don’t dismiss junior engineers.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.44
All right. So today for our first episode of this podcast in 2026, we have Aurimas Griciūnas of SwirlAI. And he was previously at Neptune.ai. Welcome to the podcast, Aurimas. 

01.02
Hi, Ben, and thank you for having me on the podcast. 

01.07
So actually, I want to start with a little bit of culture before we get into some technical things. I noticed now it seems like you’re back to teaching people some of the latest ML and AI stuff. Of course, before the advent of generative AI, the terms we were using were ML engineer, MLOps. . . Now it seems like it’s AI engineer and maybe LLMOps. I’m assuming you use this terminology in your teaching and consulting as well.

So in your mind, Aurimas, what are some of the biggest distinctions between that move from ML engineer to AI engineer, from MLOps to LLMOps? What are two to three of the biggest things that people should understand?

02.05
That’s a great question, and the answer depends on how you define AI engineering. I think how most of the people today define it is a discipline that builds systems on top of already existing large language models, maybe some fine-tuning, maybe some tinkering with the models. But it’s not about the model training. It’s about building systems or systems on top of the models that you already have.

So the distinction is quite big because we are no longer creating models. We are reusing models that we already have. And hence the discipline itself becomes a lot more similar to software engineering than actual machine learning engineering. So we are not training models. We are building on top of the models. But some of the similarities remain because both of the systems that we used to build as machine learning engineers and now we build as AI engineers are nondeterministic in their nature.

So some evaluation and practices of how we would evaluate these systems remain. In general, I would even go as far as to say that, there are more differences than similarities in these two disciplines, and it’s really, really hard to properly distinguish three main ones. Right?

03.38
So I would say software engineering, right. . . 

03.42
So, I guess, based on your description there, the personas have changed as well.

So in the previous incarnation, you had ML teams, data science teams—they were mostly the ones responsible for doing a lot of the building of the models. Now, as you point out, at most people are doing some sort of posttraining from fine-tuning. Maybe the more advanced teams are doing some sort of RL, but that’s really limited, right?

So the persona has changed. But on the other hand, at some level, Aurimas, it’s still a model, so then you still need the data scientist to interpret some of the metrics and the evals, correct? In other words, if you run with completely just “Here’s a bunch of software engineers; they’ll do everything,” obviously you can do that, but is that something you recommend without having any ML expertise in the team? 

04.51
Yes and no. A year ago or two years ago, maybe one and a half years ago, I would say that machine learning engineers were still the best fit for AI engineering roles because we were used to dealing with nondeterministic systems.

They knew how to evaluate something that the output of which is a probabilistic function. So it is more of a mindset of working with these systems and the practices that come from actually building machine learning systems beforehand. That’s very, very useful for dealing with these systems.

05.33
But nowadays, I think already many people—many specialists, many software engineers—have already tried to upskill in this nondeterminism and learn quite a lot [about] how you would evaluate these kinds of systems. And the most valuable specialist nowadays, [the one who] can actually, I would say, bring the most value to the companies building these kinds of systems is someone who can actually build end-to-end, and so has all kinds of skills, starting from being able to figure out what kind of products to build and actually implementing some POC of that product, shipping it, exposing it to the users and being able to react [to] the feedback [from] the evals that they built out for the system. 

06.30
But the eval part can be learned. Right. So you should spend some time on it. But I wouldn’t say that you need a dedicated data scientist or machine learning engineer specifically dealing with evals anymore. Two years ago, probably yes. 

06.48
So based on what you’re seeing, people are beginning to organize accordingly. In other words, the recognition here is that if you’re going to build some of these modern AI systems or agentic systems, it’s really not about the model. It’s a systems and software engineering problem. So therefore we need people who are of that mindset. 

But on the other hand, it is still data. It’s still a data-oriented system, so you might still have pipelines, right? Data pipelines to data teams that data engineers typically maintain. . . And there’s always been this lamentation even before the rise of generative AI: “Hey, these data pipelines maintained by data engineers are great, but they don’t have the same software engineering rigor that, you know, the people building web applications are used to.” What’s your sense in terms of the rigor that these teams are bringing to the table in terms of software engineering practices? 

08.09
It depends on who is building the system. AI engineers [comprise an] extremely wide range. An engineer can be an AI engineer. A software engineer could be an AI engineer, and a machine learning engineer can be an AI engineer. . .

08.31 
Let me rephrase that, Aurimas. In your mind, [on] the best teams, what’s the typical staffing pattern? 

08.39
It depends on the size of the project. If it’s just a project that’s starting out, then I would say a full stack engineer can quickly actually start off a project, build A, B, or C, and continue expanding it. And then. . .

08.59
Mainly relying on some sort of API endpoint for the model?

09.04
Not necessarily. So it can be a Rest API-based system. It can be a stream processing-based system. It can be just a CLI script. I would never encourage [anyone] to build a system which is more complex than it needs to be, because very often when you have an idea, just to prove that it works, it’s enough to build out, you know, an Excel spreadsheet with a column of inputs and outputs and then just give the outputs to the stakeholder and see if it’s useful.

So it’s not always needed to start with a Rest API. But in general, when it comes to who should start it off, I think it’s people who are very generalist. Because at the very beginning, you need to understand end to end—from product to software engineering to maintaining those systems.

10.01
But once this system evolves in complexity, then very likely the next person you would be bringing on—again, depending on the product—very likely would be someone who is good at data engineering. Because as you mentioned before, most of the systems are relying on a very high, very strong integration of these already existing data systems [that] you’re building for an enterprise, for example. And that’s a hard thing to do right. And the data engineers do it quite [well]. So definitely a very useful person to have in the team. 

10.43
And maybe eventually, once those evals come into play, depending on the complexity of the product, the team might benefit from having an ML engineer or data scientist in between. But then this is more kind of targeting those cases where the product is complex enough that you actually need some allowances for judges, and then you need to evaluate those LLMs as judges so that your evals are evaluated as well.

If you just need some simple evals—because some of them can be exact assertion-based evals—those can easily be done, I think, by someone who doesn’t have past machine learning experience.

11.36
Another cultural question I have is the following. I would say two years ago, 18 months ago, most of these AI projects were conducted. . . Basically, it was a little more decentralized, in other words. So here’s a group here. They’re going to do something. They’re going to build something on their own and then maybe try to deploy that. 

But now recently I’m hearing, Aurimas, and I don’t know if you are hearing the same thing, that, at least in some of these big companies, they’re starting to have much more of a centralized team that can help other teams.

So in other words, there’s a centralized team that somehow has the right experience and has built a few of these things. And then now they can kind of consolidate all those learnings and then help other teams. If I’m in one of these organizations, then I approach these experts. . . I guess in the old, old days—I hate this term—they would use some center of excellence kind of thing. So you will get some sort of playbook and they will help you get going. Sort of like in your previous incarnation at Neptune.ai. . . It’s almost like you had this centralized tool and experiment tracker where someone can go in and learn what others are doing and then learn from each other.

Is this something that you’re hearing that people are going for more of this kind of centralized approach? 

13.31
I do hear about these kinds of situations, but naturally, it’s always a big enterprise that’s managed to pull that off. And I believe that’s the right approach because that’s also what we have been doing before GenAI. We had those centers of excellence. . . 

13.52
I guess for our audience, explain why you think this is the right approach. 

13.58
So, two things why I think it is the right approach. The first thing is that we used to have these platform teams that would build out a shared pool of software that can be reused by other teams. So we kind of defined the standards of how these systems should be operated, and the production and the development. And they would decide what kind of technologies and tech stack should be used within the company. So I think it’s a good idea to not spread too widely in the tools that you’re using. 

Also, have template repositories that you can just pool and reuse. Because then not only is it easier to kick off and start your build out of the project, but it also helps control how well this knowledge can actually be centralized, because. . .

14.59
And also there’s security, then there’s governance as well. . . 

15.03
For example, yes. The platform side is one of those—just use the same stack and help others build it easier and faster. And the second piece is that obviously GenAI systems are still very young. So [it’s] very early and we really do not have, as some would say, enough reps in building these kinds of systems.

So we learn as we go. With regular machine learning, we already had everything figured out. We just needed some practice. Now, if we learn in this distributed way and then we do not centralize learnings, we suffer. So basically, that’s why you would have a central team that holds the knowledge. But then it should, you know, help other teams implement some new type of system and then bring those learnings back into the central core and then spread those learnings back to other teams.

But this is also how we used to operate in these platform teams in the old days, three years, four years ago. 

16.12
Right, right, right, right, right, right, right. But then, I guess, what happened with the release of generative AI is that the platform teams might have moved too slow for the rank and file. And so hence you started hearing about what they call shadow AI, where people would use tools that were not exactly blessed by the platform team. But now I think the platform teams are starting to arrest some of that. 

16.42
I wonder if it is platform teams who are kind of catching up, or is it the tools that [are] maturing and the practices that are maturing? I think we are getting more and more reps in building those systems, and now it’s easier to catch up with everything that’s going on. I would even go as far as to say it was impossible to be on top of it, and maybe it wouldn’t even make sense to have a central team.

17.10
A lot of these demos look impressive—generative AI demos, agents—but they fail when you deploy them in the wild. So in your mind, what is the single biggest hurdle or the most common reason why a lot of these demos or POCs fall short or become unreliable in production? 

17.39
That again, depends on where we are deploying the system. But one of the main reasons is that it is very easy to build a POC, and then it targets a very specific and narrow set of real-world scenarios. And we kind of believe that it solves [more than it does]. It just doesn’t generalize well to other types of scenarios. And that’s the biggest problem.

18.07
Of course there are security issues and all kinds of stability issues, even with the biggest labs and the biggest providers of LLMs, because those APIs are also not always stable, and you need to take care of that. But that’s an operational issue. I think the biggest issue is not operational. It’s actually evaluation-based, and sometimes even use case-based: Maybe the use case is not the correct one. 

18.36
You know, before the advent of generative AI, ML teams and data teams were just starting to get going on observability. And then obviously AI generative AI comes into the picture. So what changes as far as LLMs and generative AI when it comes to observability? 

19.00
I wouldn’t even call observability of regular machine learning systems and [of] AI systems the same thing.

Going back to a previous parallel, generative AI observability is a lot more similar to regular software observability. It’s all about tracing your application and then on top of those traces that you collect in the same way as you would collect from the regular software application, you add some additional metadata so that it is useful for performing evaluation actions on your agent AI type of system.

So I would even contrast machine learning observability with GenAI observability because I think these are two separate things.

19.56
Especially when it comes to agents and the agents that involve some sort of tool use, then you’re really getting into kind of software traces and software observability at that point. 

20.13
Exactly. Tool use is just a function call. A function call is just a considerable software span, let’s say. Now what’s important for GenAI is that you also know why that tool was selected to be used. And that’s where you trace outputs of your LLMs. And you know why that LLM call, that generation, has decided to use this and not the other tool.

So things like prompts, token counts, and how much time to first token it took for which generation, these kinds of things are what is additional to be traced compared to regular, software tracing. 

20.58
And then, obviously, there’s also. . . I guess one of the main changes probably this year will be multimodality, if there’s different types of modes and data involved.

21.17
Right. For some reason I didn’t touch upon that, but you’re right. There’s a lot of difference here because inputs and outputs, it’s hard. First of all, it’s hard to trace these kinds of things like, let’s say, audio input and output [or] video images. But I think [an] even harder kind of problem with this is how do you make sure that the data that you trace is useful?

Because those observability systems that are being built out, like LangSmith, Langfuse, and all of others, you know, how do you make it so that it’s convenient to actually look at the data that you trace, which is not text and not regular software spans? How [do] you build, [or] even correlate, two different audio inputs to each other? How do you do that? I don’t think that problem is solved yet. And I don’t even think that we know what we want to see when it comes to comparing this kind of data next to each other. 

22.30
So let’s talk about agents. A friend of mine actually asked me yesterday, “So, Ben, are agents real, especially on the consumer side?” And my friend was saying he doesn’t think it’s real. So I said, actually, it’s more real than people think in the following sense: First of all, deep research, that’s agents. 

And then secondly, people might be using applications that involve agents, but they don’t know it. So, for example, they’re interacting with the system and that system involves some sort of data pipeline that was written and is being monitored and maintained by an agent. Sure, the actual application is not an agent. But underneath there’s agents involved in the application

So to that extent, I think agents are definitely real in the data engineering and software engineering space. But I think there might be more consumer apps that underneath there’s some agents involved that consumers don’t know about. What’s your sense? 

23.41
Quite similar. I don’t think there are real, full-fledged agents that are exposed. 

23.44
I think people when people think of agents, they think of it as like they’re interacting with the agent directly. And that may not be the case yet. 

24.04
Right. So then, it depends on how you define the agent. Is it a fully autonomous agent? What is an agent to you? So, GenAI in general is very useful on many occasions. It doesn’t necessarily need to be a tool-using self-autonomous agent.

24.21
So like I said, the canonical example for consumers would be deep research. Those are agents.

24.27
Those are agents, that’s for sure. 

24.30
If you think of that example, it’s a bunch of agents searching across different data collections, and then maybe a central agent unifying and presenting it to the user in a coherent way.

So from that perspective, there probably are agents powering consumer apps. But they may not be the actual interface of the consumer app. So the actual interface might still be rule-based or something. 

25.07
True. Like data processing. Some automation is happening in the background. And a deep research agent, that is exposed to the user. Now that’s relatively easy to build because you don’t need to very strongly evaluate this kind of system. Because you expect the user to eventually evaluate the results. 

25.39
Or in the case of Google, you can present both: They have the AI summary, and then they still have the search results. And then based on the user signals of what the user is actually consuming, then they can continue to improve their deep research agent. 

25.59
So let’s say the disasters that can happen from wrong results were not that bad. Right? So. 

26.06
Oh, no, it can be bad if you deploy it inside the enterprise, and you’re using it to prepare your CFO for some earnings call, right?

26.17
True, true. But then you know whose responsibility is it? The agent’s, that provided 100%…? 

26.24
You can argue that’s still an agent, but then the finance team will take those results and scrutinize [them] and make sure they’re correct. But an agent prepared the initial version. 

26.39
Exactly, exactly. So it still needs review.

26.42
Yeah. So the reason I bring up agents is, do agents change anything from your perspective in terms of eval, observability, and anything else? 

26.55
They do a little bit, compared to agent workflows that are not, full agents, the only change that really happens. . . And we are talking now about multi-agent systems, where multiple agents can be chained or looped in together. So really the only difference there is that the length of the trace is not deterministic. And the amount of spans is not deterministic. So in the sense of observability itself, the difference is minimal as long as those agents and multi-agent systems are running in a single runtime.

27.44
Now, when it comes to evals and evaluation, it is different because you evaluate different aspects of the system. You try to discover different patterns of failures. As an example, if you’re just running your agent workflow, then you know what kind of steps can be taken, and then you can be almost 100% sure that the entire path from your initial intent to the final answer is completed. 

Now with agent systems and multi-agent systems, you can still achieve, let’s say, input-output. But then what happens in the middle is not a black box, but it is very nondeterministic. Your agents can start looping the same questions between each other. So you need to also look for failure signals that are not present in agentic workflows, like too many back-and-forth [responses] between the agents, which wouldn’t happen in a regular agentic workflow.

Also, for tool use and planning, you need to figure out if the tools are being executed in the correct order. And similar things. 

29.09
And that’s why I think in that scenario, you definitely need to collect fine-grained traces, because there’s also the communication between the agents. One agent might be lying to another agent about the status of completion and so on and so forth. So you need to really kind of have granular level traces at that point. Right? 

29.37
I would even say that you always need to have written the lower-level pieces, even if you’re running a simple RAG system, which you will learn by the generation system, you still need those granular traces for each of the actions.

29.52
But definitely, interagent communication introduces more points of failure that you really need to make sure that you also capture. 

So in closing, I guess, this is a fast-moving field, right? So there’s the challenge for you, the individual, for your professional development. But then there’s also the challenge for you as an AI team in how you keep up. So any tips at both the individual level and at the team level, besides going to SwirlAI and taking courses? [laughs] What other practical tips would you give an individual in the team? 

30.47
So for individuals, for sure, learn fundamentals. Don’t rely on frameworks alone. Understand how everything is really working under the hood; understand how those systems are actually connected.

Just think about how those prompts and context [are] actually glued together and passed from an agent to an agent. Do not think that you will be able to just mount a framework right on top of your system, write [a] few prompts, and everything will magically work. You need to understand how the system works from the first principles.

So yeah. Go deep. That’s for individual practitioners. 

31.32
When it comes to teams, well, that’s a very good question and a very hard question. Because, you know, in the upcoming one or two years, everything can change so much. 

31.44
And then one of the challenges, Aurimas, for example, in the data engineering space. . . It used to be, several years ago, I have a new data engineer in the team. I have them build some basic pipelines. Then they get confident, [and] then they build more complex pipelines and so on and so forth. And then that’s how you get them up to speed and get them more experience.

But the challenge now is a lot of those basic pipelines can be built with agents, and so there’s some amount of entry-level work that used to be the place where you can train your entry-level people. Those are disappearing, which also impacts your talent pipeline. If you don’t have people at the beginning, then you won’t have experienced people later on.

So any tips for teams and the challenge of the pipeline for talent?

32.56
That’s such a hard question. I would like to say, do not dismiss junior engineers. Train them. . .

33.09
Oh, I yeah, I agree completely. I agree completely.

33.14
But that’s a hard decision to make, right? Because you need to be thinking about the future.

33.26
I think, Aurimas, the mindset people have to [have is to] say, okay, so the traditional training grounds we had, in this example of the data engineer, were these basic pipelines. Those are gone. Well, then we find a different way for them to enter. It might be they start managing some agents instead of building pipelines from scratch. 

33.56
We’ll see. We’ll see. But we don’t know. 

33.58
Yeah. Yeah. We don’t know. The agents even in the data engineering space are still human-in-the-loop. So in other words a human still needs to monitor [them] and make sure they’re working. So that could be the entry-level for junior data engineers. Right? 

34.13
Right. But you know that’s the hard part about this question. Then answer is, that could be, but we do not know, and for now maybe it doesn’t make sense. . .

34.28
My point is that if you stop hiring these juniors, I think that’s going to hurt you down the road. So you just hired a junior and hired the junior and then stick them in a different track, and then, as you say, things might change, but then they can adapt. If you hire the right people, they will be able to adapt. 

34.50
I agree, I agree, but then, there are also people who are potentially not right for that role, let’s say, and you know, what I. . . 

35.00
But that’s true even when you hired them and you assigned them to build pipelines. So same thing, right? 

35.08
The same thing. But the thing I see with the juniors and less senior people who are currently building is that we are relying too much on vibe coding. I would also suggest looking for some ways on how to onboard someone new and make sure that the person actually learns the craft and not just comes in and vibe codes his or her way around, making more issues for senior engineers then actually helps. 

35.50
Yeah, this is a big topic, but one of the challenges, all I can say is that, you know, the AI tools are getting better at coding at some level because the people building these models are using reinforcement learning and the signal in reinforcement learning is “Does the code run?” So then what people are ending up with now with this newer generation of these models is [that] they vibe code and they will get code that runs because that’s what the reinforcement learning is optimizing for.

But that doesn’t mean that that code doesn’t introduce proper to the right. But on the face of it, it’s running, right? An experienced person obviously can probably handle that. 

But anyway, so last word, you get the last word, but take us on a positive note. 

36.53
[laughs] I do believe that the future is bright. It’s not grim, not dark. I am very excited about what is happening in the AI space. I do believe that it will not be as fast. . . All this AGI and AI taking over human jobs, it will not happen as fast as everyone is saying. So you shouldn’t be worried about that, especially when it comes to enterprises. 

I believe that we already had [very powerful] technology one or one and a half years ago. [But] for enterprises to even utilize that kind of technology, which we already had one and a half years ago, will still take another five years or so to fully actually get the most out of it. So there will be enough work and jobs for at least the upcoming 10 years. And I think, people should not be worried too much about it.

38.06
But in general, eventually, even the ones who will lose their jobs will probably respecialize in that long period of time to some more valuable role. 

38.18
I guess I will close with the following advice: The main thing that you can do is just keep using these tools and keep learning. I think the distinction will be increasingly between those who know how to use these tools well and those who do not.

And with that, thank you, Aurimas.

💾

The Problem with AI “Artists”

14 January 2026 at 07:07

A performance reel. Instagram, TikTok, and Facebook accounts. A separate contact email for enquiries. All staples of an actor’s website.

Except these all belong to Tilly Norwood, an AI “actor.”

This creation represents one of the newer AI trends, which is AI “artists” that eerily represent real humans (which, according to their creators, is the goal). Eline Van der Velden, the creator of Tilly Norwood, has said that she is focused on making the creation “a big star” in the “AI genre,” a distinction that has been used to justify the existence of AI created artists as not taking away jobs from real actors. Van der Velden has explicitly said that Tilly Norwood was made to be photorealistic to provoke a reaction, and it’s working, as reportedly talent agencies are looking to represent it.

And it’s not just Hollywood. Major producer Timbaland has created his own AI entertainment company and launched his first “artist,” TaTa, with the music created by uploading demos of his own to the platform Suno, reworking it with AI, and adding lyrics afterward.

But while technologically impressive, the emergence of AI “artists” risks devaluing creativity as a fundamentally human act, and in the process, dehumanizing and “slopifying” creative labor.

Heightening Industry at the Expense of Creativity

The generative AI boom is deeply tied to creative industries, with profit-hungry machines monetizing every movie, song, and TV show as much as they possibly can. This, of course, predates AI “artists,” but AI is making the agenda even clearer. One of the motivations behind the Writer’s Guild Strike of 2023 was countering the threat of studios replacing writers with AI.

For industry power players, employing AI “artists” means less reliance on human labor—cutting costs and making it possible to churn out products at a much higher rate. And in an industry already known for poor working conditions, there’s significant appeal in dealing with a creation they do not “need” to treat humanely.

Technological innovation has always posed a risk to eliminating certain jobs, but AI “artists” are a whole new monster in industry. It isn’t just about speeding up processes or certain tasks but about excising human labor from the product. This means in an industry that is already notoriously hard to make money in as a creative, the demand will become even more scarce—and that’s not even looking at the consequences on the art itself.

The AI “Slop” Takeover

The interest of making money over quality has always prevailed in industry; Netflix and Hallmark aren’t making all those Christmas romantic comedies with the same plot because they’re original stories, nor are studios embracing endless amount of reboots and remakes based on successful art because it would be visionary to remake a ’90s movie with a 20-something Hollywood star. But they still have their audiences, and in the end, require creative output and labor to be made.

Now, imagine that instead of these rom-coms cluttering Netflix, we have AI-generated movies and TV shows, starring creations like Tilly Norwood, and the soundtrack comes from a voice, lyrics, and production that was generated by AI.

The whole model of generative AI is dependent on regurgitating and recycling existing data. Admittedly, it’s a technological feat that Suno can generate a song and Sora can convert text to video images; what it is NOT is a creative renaissance. AI-generated writing is already taking over, from essays in the classroom to motivational LinkedIn posts, and in addition to ruining the em dash, it consistently puts out material of low and robotic quality. AI “artists” “singing” and “acting” is the next uncanny destroyer of quality and likely will alienate audiences, who turn to art to feel connection.

Art has a long tradition of being used as resistance and a way of challenging the status quo; protest music has been a staple of culture—look no further than civil rights and antiwar movements in the United States in the 1960s. It is so powerful that there are attempts by political actors to suppress it and punish artists. Iranian filmmaker Jafar Panahi, who won the Palme d’Or at the Cannes Film Festival for It Was Just an Accident, was sentenced to prison in absentia in Iran for making the film, and this is not the first punishment he has received for his films. Will studios like Sony or Warner Bros. release songs or movies like these if they can just order marketing-compliant content from a bot?

A sign during the writer’s strike famously said “ChatGPT doesn’t have childhood trauma.” An AI “artist” may be able to carry out a creator’s agenda to a limited extent, but what value does it have coming from a generated creation that has no lived experiences and emotions—especially when this drives motivation to make art in the first place?

To top it off, generative AI is not a neutral entity by any means; we’re in for a lot of stereotypical and harmful material, especially without the input of real artists. The fact most AI “artists” are portrayed as young women with specific physical features is not a coincidence. It’s an intensification of the longstanding trend of making virtual assistants—from ELIZA to Siri to Alexa to AI “artists” like Tilly Norwood or Timbaland’s TaTa—“female,” which reinforces the trope of relegating women to “helper” roles that are designed to cater to the needs of the user, a clear manifestation of human biases.

Privacy and Plagiarism

Ensuring that “actors” and “singers” look and sound as human as possible in films, commercials, and songs requires that they be trained on real-world data. Tilly Norwood creator Van der Welden has defended herself by claiming that she only used licensed data and went through an extensive research process, looking at thousands of images for her creation. But “licensed data” does not make taking the data automatically ethical; look at Reddit, which signed a multimillion dollar contract to allow Google to train its AI models on Reddit data. The vast data of Reddit users is not protected, just monetized by the organization.

AI expert Ed Newton-Rex has discussed how generative AI is consistently stealing from artists, and has proposed measures in place to make sure data is licensed and trained in the public domain to be used in creating. There are ways for individual artists to protect their online work: including watermarks, opting out of data collection, and taking measures to block AI bots. While these strategies can keep data more secure, considering how vast generative AI is, they’re probably more a safeguard than a solution.

Jennifer King from Stanford’s Human-Centered Artificial Intelligence has provided some ways to protect data and personal information more generally, such as making “opt out” the default option for data sharing, and for legislation that focuses not just on transparency of AI use but on its regulation—likely an uphill battle with the Trump administration trying to take away state AI regulations.

This is the ethical home that AI “artists” are living in. Think of all the faces of real people that went into making Tilly Norwood. A company may have licensed that data for use, but the artists whose “data” is their likeness and creativity likely didn’t (at least directly). In this light, AI “artists” are a form of plagiarism.

Undermining Creativity as Fundamentally Human

Looking at how art has been transformed by technology before generative AI, it could be argued that this is simply the next step in the process of change rather than something to be concerned about. But photography and animation and typewriters and all the other inventions used to justify the onslaught of AI “artists” were not eliminations of human creativity. Photography was not a replacement to painting but a new art form, even if it did concern painters. There’s a difference between having a new, experimental way of doing something and extensively using data (particularly data that is taken without consent) to make creations that blur the lines of what is and isn’t human.  For instance, Rebecca Xu, a professor of computer art and animation at Syracuse who teaches an “AI in Creative Practice” course, argues that artists can incorporate AI into their creative process. But as she warns, “AI offers useful tools, but you still need to produce your own original work instead of using something generated by AI.”

It’s hard to understand exactly how AI “artists” benefit human creativity, which is a fundamental part of our expression and intellectual development. Just look at the cave art from the Paleolithic era. Even humans 30,000 years ago who didn’t have secure food and shelter were making art. Unlike other industries, art did not come into existence purely for profit.

The arts are already undervalued economically, as is evident from the lack of funding in schools. Today, a kid who may want to be a writer will likely be bombarded with marketing from generative AI platforms like ChatGPT to use these tools to “write” a story. The result may resemble a narrative, but there’s not necessarily any creativity or emotional depth that comes from being human, and more importantly, the kid didn’t actually write. Still, the very fact that this AI-generated story is now possible curbs the industrial need for human artists.

How Do We Move Forward?

Though profit-hungry power players may be embracing AI “artists,” the same cannot be said for public opinion. The vast majority of artists and audiences alike are not interested in AI-generated art, much less AI “artists.” The power of public opinion shouldn’t be underestimated; the writer’s strike is probably the best example of that.

Collective mobilization thus will likely be key in the future when it comes to challenging AI “artists” against the interest of studios, record labels, and other members of the creative industry’s ruling class. There have been wins already, such as the Writer’s Guild of America Strike in 2023, which resulted in a contract stipulating that studios can’t use AI as a credited writer. And because music and film and television are full of stars, often with financial and cultural power, the resistance being voiced in the media could benefit from more actionable steps; for example, maybe a prominent production company run by an A-list actor pledges not to have any “artists” generated by AI in their work.

Beyond industry and labor, the devaluing of art as unimportant unless you’re a “star” can also play a significant role in changing conversations around it. This means funding art programs in schools and libraries so that young people know that art is something they can do, something that is fun and that brings joy—not necessarily to make money or a living but to express themselves and engage with the world.

The fundamental risk of AI “artists” is that they will become so commonplace that it will feel pointless to pursue art, and that much of the art we consume will lose its fundamentally human qualities. But human-made art and human artists will never become obsolete—that would require fundamentally eliminating human impulses and the existence of human-made art. The challenge is making sure that artistic creation is not relegated to the margins of life.

GPUs: Enterprise AI’s New Architectural Control Point

13 January 2026 at 11:34

Over the past two years, enterprises have moved rapidly to integrate large language models into core products and internal workflows. What began as experimentation has evolved into production systems that support customer interactions, decision-making, and operational automation.

As these systems scale, a structural shift is becoming apparent. The limiting factor is no longer model capability or prompt design but infrastructure. In particular, GPUs have emerged as a defining constraint that shapes how enterprise AI systems must be designed, operated, and governed.

This represents a departure from the assumptions that guided cloud native architectures over the past decade: Compute was treated as elastic, capacity could be provisioned on demand, and architectural complexity was largely decoupled from hardware availability. GPU-bound AI systems don’t behave this way. Scarcity, cost volatility, and scheduling constraints propagate upward, influencing system behavior at every layer.

As a result, architectural decisions that once seemed secondary—how much context to include, how deeply to reason, and how consistently results must be reproduced—are now tightly coupled to physical infrastructure limits. These constraints affect not only performance and cost but also reliability, auditability, and trust.

Understanding GPUs as an architectural control point rather than a background accelerator is becoming essential for building enterprise AI systems that can operate predictably at scale.

The Hidden Constraints of GPU-Bound AI Systems

GPUs break the assumption of elastic compute

Traditional enterprise systems scale by adding CPUs and relying on elastic, on-demand compute capacity. GPUs introduce a fundamentally different set of constraints: limited supply, high acquisition costs, and long provisioning timelines. Even large enterprises increasingly encounter situations where GPU-accelerated capacity must be reserved in advance or planned explicitly rather than assumed to be instantly available under load.

This scarcity places a hard ceiling on how much inference, embedding, and retrieval work an organization can perform—regardless of demand. Unlike CPU-centric workloads, GPU-bound systems cannot rely on elasticity to absorb variability or defer capacity decisions until later. Consequently, GPU-bound inference pipelines impose capacity limits that must be addressed through deliberate architectural and optimization choices. Decisions about how much work is performed per request, how pipelines are structured, and which stages justify GPU execution are no longer implementation details that can be hidden behind autoscaling. They’re first-order concerns.

Why GPU efficiency gains don’t translate into lower production costs

While GPUs continue to improve in raw performance, enterprise AI workloads are growing faster than efficiency gains. Production systems increasingly rely on layered inference pipelines that include preprocessing, representation generation, multistage reasoning, ranking, and postprocessing.

Each additional stage introduces incremental GPU consumption, and these costs compound as systems scale. What appears efficient when measured in isolation often becomes expensive once deployed across thousands or millions of requests.

In practice, teams frequently discover that real-world AI pipelines consume materially more GPU capacity than early estimates anticipated. As workloads stabilize and usage patterns become clearer, the effective cost per request rises—not because individual models become less efficient but because GPU utilization accumulates across pipeline stages. GPU capacity thus becomes a primary architectural constraint rather than an operational tuning problem.

When AI systems become GPU-bound, infrastructure constraints extend beyond performance and cost into reliability and governance. As AI workloads expand, many enterprises encounter growing infrastructure spending pressures and increased difficulty forecasting long-term budgets. These concerns are now surfacing publicly at the executive level: Microsoft AI CEO Mustafa Suleyman has warned that remaining competitive in AI could require investments in the hundreds of billions of dollars over the next decade. The energy demands of AI data centers are also increasing rapidly, with electricity use expected to rise sharply as deployments scale. In regulated environments, these pressures directly impact predictable latency guarantees, service-level enforcement, and deterministic auditability.

In this sense, GPU constraints directly influence governance outcomes.

When GPU Limits Surface in Production

Consider a platform team building an internal AI assistant to support operations and compliance workflows. The initial design was straightforward: retrieve relevant policy documents, run a large language model to reason over them, and produce a traceable explanation for each recommendation. Early prototypes worked well. Latency was acceptable, costs were manageable, and the system handled a modest number of daily requests without issue.

As usage grew, the team incrementally expanded the pipeline. They added reranking to improve retrieval quality, tool calls to fetch live data, and a second reasoning pass to validate answers before returning them to users. Each change improved quality in isolation. But each also added another GPU-backed inference step.

Within a few months, the assistant’s architecture had evolved into a multistage pipeline: embedding generation, retrieval, reranking, first-pass reasoning, tool-augmented enrichment, and final synthesis. Under peak load, latency spiked unpredictably. Requests that once completed in under a second now took several seconds—or timed out entirely. GPU utilization hovered near saturation even though overall request volume was well below initial capacity projections.

The team initially treated this as a scaling problem. They added more GPUs, adjusted batch sizes, and experimented with scheduling. Costs climbed rapidly, but behavior remained erratic. The real issue was not throughput alone—it was amplification. Each user query triggered multiple dependent GPU calls, and small increases in reasoning depth translated into disproportionate increases in GPU consumption.

Eventually, the team was forced to make architectural trade-offs that had not been part of the original design. Certain reasoning paths were capped. Context freshness was selectively reduced for lower-risk workflows. Deterministic checks were routed to smaller, faster models, reserving the larger model only for exceptional cases. What began as an optimization exercise became a redesign driven entirely by GPU constraints.

The system still worked—but its final shape was dictated less by model capability than by the physical and economic limits of inference infrastructure.

This pattern—GPU amplification—is increasingly common in GPU-bound AI systems. As teams incrementally add retrieval stages, tool calls, and validation passes to improve quality, each request triggers a growing number of dependent GPU operations. Small increases in reasoning depth compound across the pipeline, pushing utilization toward saturation long before request volumes reach expected limits. The result is not a simple scaling problem but an architectural amplification effect in which cost and latency grow faster than throughput.

Reliability Failure Modes in Production AI Systems

Many enterprise AI systems are designed with the expectation that access to external knowledge and multistage inference will improve accuracy and robustness. In practice, these designs introduce reliability risks that tend to surface only after systems reach sustained production usage.

Several failure modes appear repeatedly across large-scale deployments.

Temporal drift in knowledge and context

Enterprise knowledge is not static. Policies change, workflows evolve, and documentation ages. Most AI systems refresh external representations on a scheduled basis rather than continuously, creating an inevitable gap between current reality and what the system reasons over.

Because model outputs remain fluent and confident, this drift is difficult to detect. Errors often emerge downstream in decision-making, compliance checks, or customer-facing interactions, long after the original response was generated.

Pipeline amplification under GPU constraints

Production AI queries rarely correspond to a single inference call. They typically pass through layered pipelines involving embedding generation, ranking, multistep reasoning, and postprocessing, each stage consuming additional GPU resources. Systems research on transformer inference highlights how compute and memory trade-offs shape practical deployment decisions for large models. In production systems, these constraints are often compounded by layered inference pipelines—where additional stages amplify cost and latency as systems scale.

Each stage consumes GPU resources. As systems scale, this amplification effect turns pipeline depth into a dominant cost and latency factor. What appears efficient during development can become prohibitively expensive when multiplied across real-world traffic.

Limited observability and auditability

Many AI pipelines provide only coarse visibility into how responses are produced. It’s often difficult to determine which data influenced a result, which version of an external representation was used, or how intermediate decisions shaped the final output.

In regulated environments, this lack of observability undermines trust. Without clear lineage from input to output, reproducibility and auditability become operational challenges rather than design guarantees.

Inconsistent behavior over time

Identical queries issued at different points in time can yield materially different results. Changes in underlying data, representation updates, or model versions introduce variability that’s difficult to reason about or control.

For exploratory use cases, this variability may be acceptable. For decision-support and operational workflows, temporal inconsistency erodes confidence and limits adoption.

Why GPUs Are Becoming the Control Point

Three trends converge to elevate GPUs from infrastructure detail to architectural control point.

GPUs determine context freshness. Storage is inexpensive, but embedding isn’t. Maintaining fresh vector representations of large knowledge bases requires continuous GPU investment. As a result, enterprises are forced to prioritize which knowledge remains current. Context freshness becomes a budgeting decision.

GPUs constrain reasoning depth. Advanced reasoning patterns—multistep analysis, tool-augmented workflows, or agentic systems—multiply inference calls. GPU limits therefore cap not only throughput but also the complexity of reasoning an enterprise can afford.

GPUs influence model strategy. As GPU costs rise, many organizations are reevaluating their reliance on large models. Small language models (SLMs) offer predictable latency, lower operational costs, and greater control, particularly for deterministic workflows.
This has led to hybrid architectures in which SLMs handle structured, governed tasks, with larger models reserved for exceptional or exploratory scenarios.

What Architects Should Do

Recognizing GPUs as an architectural control point requires a shift in how enterprise AI systems are designed and evaluated. The goal isn’t to eliminate GPU constraints; it’s to design systems that make those constraints explicit and manageable.

Several design principles emerge repeatedly in production systems that scale successfully:

Treat context freshness as a budgeted resource. Not all knowledge needs to remain equally fresh. Continuous reembedding of large knowledge bases is expensive and often unnecessary. Architects should explicitly decide which data must be kept current in near real time, which can tolerate staleness, and which should be retrieved or computed on demand. Context freshness becomes a cost and reliability decision, not an implementation detail.

Cap reasoning depth deliberately. Multistep reasoning, tool calls, and agentic workflows quickly multiply GPU consumption. Rather than allowing pipelines to grow organically, architects should impose explicit limits on reasoning depth under production service-level objectives. Complex reasoning paths can be reserved for exceptional or offline workflows, while fast paths handle the majority of requests predictably.

Separate deterministic paths from exploratory ones. Many enterprise workflows require consistency more than creativity. Smaller, task-specific models can handle deterministic checks, classification, and validation with predictable latency and cost. Larger models should be used selectively, where ambiguity or exploration justifies their overhead. Hybrid model strategies are often more governable than uniform reliance on large models.

Measure pipeline amplification, not just token counts. Traditional metrics such as tokens per request obscure the true cost of production AI systems. Architects should track how many GPU-backed operations a single user request triggers end to end. This amplification factor often explains why systems behave well in testing but degrade under sustained load.

Design for observability and reproducibility from the start. As pipelines become GPU-bound, tracing which data, model versions, and intermediate steps contributed to a decision becomes harder—but more critical. Systems intended for regulated or operational use should capture lineage information as a first-class concern, not as a post hoc addition.

These practices don’t eliminate GPU constraints. They acknowledge them—and design around them—so that AI systems remain predictable, auditable, and economically viable as they scale.

Why This Shift Matters

Enterprise AI is entering a phase where infrastructure constraints matter as much as model capability. GPU availability, cost, and scheduling are no longer operational details—they’re shaping what kinds of AI systems can be deployed reliably at scale.

This shift is already influencing architectural decisions across large organizations. Teams are rethinking how much context they can afford to keep fresh, how deep their reasoning pipelines can go, and whether large models are appropriate for every task. In many cases, smaller, task-specific models and more selective use of retrieval are emerging as practical responses to GPU pressure.

The implications extend beyond cost optimization. GPU-bound systems struggle to guarantee consistent latency, reproducible behavior, and auditable decision paths—all of which are critical in regulated environments. In consequence, AI governance is increasingly constrained by infrastructure realities rather than policy intent alone.

Organizations that fail to account for these limits risk building systems that are expensive, inconsistent, and difficult to trust. Those that succeed will be the ones that design explicitly around GPU constraints, treating them as first-class architectural inputs rather than invisible accelerators.

The next phase of enterprise AI won’t be defined solely by larger models or more data. It will be defined by how effectively teams design systems within the physical and economic limits imposed by GPUs—which have become both the engine and the bottleneck of modern AI.

Author’s note: This article is based on the author’s personal views based on independent technical research and does not reflect the architecture of any specific organization.


Join us at the upcoming Infrastructure & Ops Superstream on January 20 for expert insights on how to manage GPU workloads—and tips on how to address other orchestration challenges presented by modern AI and machine learning infrastructure. In this half-day event, you’ll learn how to secure GPU capacity, reduce costs, and eliminate vendor lock-in while maintaining ML engineer productivity. Save your seat now to get actionable strategies for building AI-ready infrastructure that meets unprecedented demands for scale, performance, and resilience at the enterprise level.

O’Reilly members can
register here. Not a member? Sign up for a 10-day free trial before the event to attend—and explore all the other resources on O’Reilly.

Signals for 2026

9 January 2026 at 07:14

We’re three years into a post-ChatGPT world, and AI remains the focal point of the tech industry. In 2025, several ongoing trends intensified: AI investment accelerated; enterprises integrated agents and workflow automation at a faster pace; and the toolscape for professionals seeking a career edge is now overwhelmingly expansive. But the jury’s still out on the ROI from the vast sums that have saturated the industry. 

We anticipate that 2026 will be a year of increased accountability. Expect enterprises to shift focus from experimentation to measurable business outcomes and sustainable AI costs. There are promising productivity and efficiency gains to be had in software engineering and development, operations, security, and product design, but significant challenges also persist.  

Bigger picture, the industry is still grappling with what AI is and where we’re headed. Is AI a worker that will take all our jobs? Is AGI imminent? Is the bubble about to burst? Economic uncertainty, layoffs, and shifting AI hiring expectations have undeniably created stark career anxiety throughout the industry. But as Tim O’Reilly pointedly argues, “AI is not taking jobs: The decisions of people deploying it are.” No one has quite figured out how to make money yet, but the organizations that succeed will do so by creating solutions that “genuinely improve. . .customers’ lives.” That won’t happen by shoehorning AI into existing workflows but by first determining where AI can actually improve upon them, then taking an “AI first” approach to developing products around these insights.

As Tim O’Reilly and Mike Loukides recently explained, “At O’Reilly, we don’t believe in predicting the future. But we do believe you can see signs of the future in the present.” We’re watching a number of “possible futures taking shape.” AI will undoubtedly be integrated more deeply into industries, products, and the wider workforce in 2026 as use cases continue to be discovered and shared. Topics we’re keeping tabs on include context engineering for building more reliable, performant AI systems; LLM posttraining techniques, in particular fine-tuning as a means to build more specialized, domain-specific models; the growth of agents, as well as the protocols, like MCP, to support them; and computer vision and multimodal AI more generally to enable the development of physical/embodied AI and the creation of world models. 

Here are some of the other trends that are pointing the way forward.

Software Development

In 2025, AI was embedded in software developers’ everyday work, transforming their roles—in some cases dramatically. A multitude of AI tools are now available to create code, and workflows are undergoing a transformation shaped by new concepts including vibe coding, agentic development, context engineering, eval- and spec-driven development, and more.

In 2026, we’ll see an increased focus on agents and the protocols, like MCP, that support them; new coding workflows; and the impact of AI on assisting with legacy code. But even as software development practices evolve, fundamental skills such as code review, design patterns, debugging, testing, and documentation are as vital as ever.

And despite major disruption from GenAI, programming languages aren’t going anywhere. Type-safe languages like TypeScript, Java, and C# provide compile-time validation that catches AI errors before production, helping mitigate the risks of AI-generated code. Memory safety mandates will drive interest in Rust and Zig for systems programming: Major players such as Google, Microsoft, Amazon, and Meta have adopted Rust for critical systems, and Zig is behind Anthropic’s most recent acquisition, Bun. And Python is central to creating powerful AI and machine learning frameworks, driving complex intelligent automation that extends far beyond simple scripting. It’s also ideal for edge computing and robotics, two areas where AI is likely to make inroads in the coming year.

Takeaways

Which AI tools programmers use matter less than how they use them. With a wide choice of tools now available in the IDE and on the command line, and new options being introduced all the time, it’s useful to focus on the skills needed to produce good code rather than focusing on the tool itself. After all, whatever tool they use, developers are ultimately responsible for the code it produces.

Effectively communicating with AI models is the key to doing good work. The more background AI tools are given about a project, the better the code they generate will be. Developers have to understand both how to manage what the AI knows about their project (context engineering) and how to communicate it (prompt engineering) to get useful outputs.

AI isn’t just a pair programmer; it’s an entire team of developers. Software engineers have moved beyond single coding assistants. They’re building and deploying custom agents, often within complex setups involving multi-agent scenarios, teams of coding agents, and agent swarms. But as the engineering workflow shifts from conducting AI to orchestrating AI, the fundamentals of building and maintaining good software—code review, design patterns, debugging, testing, and documentation—stay the same and will be what elevates purposeful AI-assisted code above the crowd.

Software Architecture

AI has progressed from being something architects might have to consider to something that is now essential to their work. They can use LLMs to accelerate or optimize architecture tasks; they can add AI to existing software systems or use it to modernize those systems; and they can design AI-native architectures, an approach that requires new considerations and patterns for system design. And even if they aren’t working with AI (yet), architects still need to understand how AI relates to other parts of their system and be able to communicate their decisions to stakeholders at all levels.

Takeaways

AI-enhanced and AI-native architectures bring new considerations and patterns for system design. Event-driven models can enable AI agents to act on incoming triggers rather than fixed prompts. In 2026, evolving architectures will become more important as architects look for ways to modernize existing systems for AI. And the rise of agentic AI means architects need to stay up-to-date on emerging protocols like MCP.

Many of the concerns from 2025 will carry over into the new year. Considerations such as incorporating LLMs and RAG into existing architectures, emerging architecture patterns and antipatterns specifically for AI systems, and the focus on API and data integrations elevated by MCP are critical.

The fundamentals still matter. Tools and frameworks are making it possible to automate more tasks. However, to successfully leverage these capabilities to design sustainable architecture, enterprise architects must have a full command of the principles behind them: when to add an agent or a microservice, how to consider cost, how to define boundaries, and how to act on the knowledge they already have.

Infrastructure and Operations

The InfraOps space is undergoing its most significant transformation since cloud computing, as AI evolves from a workload to be managed to an active participant in managing infrastructure itself. With infrastructure sprawling across multicloud environments, edge deployments, and specialized AI accelerators, manual management is becoming nearly impossible. In 2026, the industry will keep moving toward self-healing systems and predictive observability—infrastructure that continuously optimizes itself, shifting the human role from manual maintenance to system oversight, architecture, and long-term strategy.

Platform engineering makes this transformation operational, abstracting infrastructure complexity behind self-service interfaces, which lets developers deploy AI workloads, implement observability, and maintain security without deep infrastructure expertise. The best platforms will evolve into orchestration layers for autonomous systems. While fully autonomous systems remain on the horizon, the trajectory is clear.

Takeaways

AI is becoming a primary driver of infrastructure architecture. AI-native workloads demand GPU orchestration at scale, specialized networking protocols optimized for model training and inference, and frameworks like Ray on Kubernetes that can distribute compute intelligently. Organizations are redesigning infrastructure stacks to accommodate these demands and are increasingly considering hybrid environments and alternatives to hyperscalers to power their AI workloads—“neocloud” platforms like CoreWeave, Lambda, and Vultr.

AI is augmenting the work of operations teams with real-time intelligence. Organizations are turning to AIOps platforms to predict failures before they cascade, identify anomalies humans would miss, and surface optimization opportunities in telemetry data. These systems aim to amplify human judgment, giving operators superhuman pattern recognition across complex environments.

AI is evolving into an autonomous operator that makes its own infrastructure decisions. Companies will implement emerging “agentic SRE” practices: systems that reason about infrastructure problems, form hypotheses about root causes, and take independent corrective action, replicating the cognitive workload that SREs perform, not just following predetermined scripts.

Data

The big story of the back half of 2025 was agents. While the groundwork has been laid, in 2026 we expect focus on the development of agentic systems to persist—and this will necessitate new tools and techniques, particularly on the data side. AI and data platforms continue to converge, with vendors like Snowflake, Databricks, and Salesforce releasing products to help customers build and deploy agents. 

Beyond agents, AI is making its influence felt across the entire data stack, as data professionals target their workflows to support enterprise AI. Significant trends include real-time analytics, enhanced data privacy and security, and the increasing use of low-code/no-code tools to democratize data access. Sustainability also remains a concern, and data professionals need to consider ESG compliance, carbon-aware tooling, and resource-optimized architectures when designing for AI workloads.

Takeaways

Data infrastructure continues to consolidate. The consolidation trend has not only affected the modern data stack but also more traditional areas like the database space. In response, organizations are being more intentional about what kind of databases they deploy. At the same time, modern data stacks have fragmented across cloud platforms and open ecosystems, so engineers must increasingly design for interoperability. 

A multiple database approach is more important than ever. Vector databases like Pinecone, Milvus, Qdrant, and Weaviate help power agentic AI—while they’re a new technology, companies are beginning to adopt vector databases more widely. DuckDB’s popularity is growing for running analytical queries. And even though it’s been around for a while, ClickHouse, an open source distributed OLAP database used for real-time analytics, has finally broken through with data professionals.

The infrastructure to support autonomous agents is coming together. GitOps, observability, identity management, and zero-trust orchestration will all play key roles. And we’re following a number of new initiatives that facilitate agentic development, including AgentDB, a database designed specifically to work effectively with AI agents; Databricks’ recently announced Lakebase, a Postgres database/OLTP engine integrated within the data lakehouse; and Tiger Data’s Agentic Postgres, a database “designed from the ground up” to support agents.

Security

AI is a threat multiplier—59% of tech professionals cited AI-driven cyberthreats as their biggest concern in a recent survey. In response, the cybersecurity analyst role is shifting from low-level human-in-the-loop tasks to complex threat hunting, AI governance, advanced data analysis and coding, and human-AI teaming oversight. But addressing AI-generated threats will also require a fundamental transformation in defensive strategy and skill acquisition—and the sooner it happens, the better.

Takeaways

Security professionals now have to defend a broader attack surface. The proliferation of AI agents expands the attack surface. Security tools must evolve to protect it. Implementing zero trust for machine identities is a smart opening move to mitigate sprawl and nonhuman traffic. Security professionals must also harden their AI systems against common threats such as prompt injection and model manipulation.

Organizations are struggling with governance and compliance. Striking a balance between data utility and vulnerability requires adherence to data governance best practices (e.g., least privilege). Government agencies, industry and professional groups, and technology companies are developing a range of AI governance frameworks to help guide organizations, but it’s up to companies to translate these technical governance frameworks into board-level risk decisions and actionable policy controls.

The security operations center (SOC) is evolving. The velocity and scale of AI-driven attacks can overwhelm traditional SIEM/SOAR solutions. Expect increased adoption of agentic SOC—a system of specialized, coordinated AI agents for triage and response. This shifts the focus of the SOC analyst from reactive alert triage to proactive threat hunting, complex analysis, and AI system oversight.

Product Management and Design

Business focus in 2025 shifted from scattered AI experiments to the challenge of building defensible, AI-native businesses. Next year we’re likely to see product teams moving from proof of concept to proof of value

One thing to look for: Design and product responsibilities may consolidate under a “product builder”—a full stack generalist in product, design, and engineering who can rapidly build, validate, and launch new products. Companies are currently hiring for this role, although few people actually possess the full skill set at the moment. But regardless of whether product builders become ascendant, product folks in 2026 and beyond will need the ability to combine product validation, good-enough engineering, and rapid design, all enabled by AI as a core accelerator. We’re already seeing the “product manager” role becoming more technical as AI spreads throughout the product development process. Nearly all PMs use AI, but they’ll increasingly employ purpose-built AI workflows for research, user-testing, data analysis, and prototyping.

Takeaways

Companies need to bridge the AI product strategy gap. Most companies have moved past simple AI experiments but are now facing a strategic crisis. Their existing product playbooks (how to size markets, roadmapping, UX) weren’t designed for AI-native products. Organizations must develop clear frameworks for building a portfolio of differentiated AI products, managing new risks, and creating sustainable value. 

AI product evaluation is now mission-critical. As AI becomes a core product component and strategy matures, rigorous evaluation is the key to turning products that are good on paper into those that are great in production. Teams should start by defining what “good” means for their specific context, then build reliable evals for models, agents, and conversational UIs to ensure they’re hitting that target.

Design’s new frontier is conversations and interactions. Generative AI has pushed user experience beyond static screens into probabilistic new multimodal territory. This means a harder shift toward designing nonlinear, conversational systems, including AI agents. In 2026, we’re likely to see increased demand for AI conversational designers and AI interaction designers to devise conversation flows for chatbots and even design a model’s behavior and personality.

What It All Means

While big questions about AI remain unanswered, the best way to plan for uncertainty is to consider the real value you can create for your users and for your teams themselves right now. The tools will improve, as they always do, and the strategies to use them will grow more complex. Being deeply versed in the core knowledge of your area of expertise gives you the foundation you’ll need to take advantage of these quickly evolving technologies—and ensure that whatever you create will be built on bedrock, not shaky ground.

The End of the Sync Script: Infrastructure as Intent

There’s an open secret in the world of DevOps: Nobody trusts the CMDB. The Configuration Management Database (CMDB) is supposed to be the “source of truth”—the central map of every server, service, and application in your enterprise. In theory, it’s the foundation for security audits, cost analysis, and incident response. In practice, it’s a work of fiction. The moment you populate a CMDB, it begins to rot. Engineers deploy a new microservice but forget to register it. An autoscaling group spins up 20 new nodes, but the database only records the original three. . . 

We call this configuration drift, and for decades, our industry’s solution has been to throw more scripts at the problem. We write massive, brittle ETL (Extract-Transform-Load) pipelines that attempt to scrape the world and shove it into a relational database. It never works. The “world”—especially the modern cloud native world—moves too fast.

We realized we couldn’t solve this problem by writing better scripts. We had to change the fundamental architecture of how we sync data. We stopped trying to boil the ocean and fix the entire enterprise at once. Instead, we focused on one notoriously difficult environment: Kubernetes. If we could build an autonomous agent capable of reasoning about the complex, ephemeral state of a Kubernetes cluster, we could prove a pattern that works everywhere else. This article explores how we used the newly open-sourced Codex CLI and theModel Context Protocol (MCP) to build that agent. In the process, we moved from passive code generation to active infrastructure operation, transforming the “stale CMDB” problem from a data entry task into a logic puzzle.

The Shift: From Code Generation to Infrastructure Operation with Codex CLI and MCP

The reason most CMDB initiatives fail is ambition. They try to track every switch port, virtual machine, and SaaS license simultaneously. The result is a data swamp—too much noise, not enough signal. We took a different approach. We drew a small circle around a specific domain: Kubernetes workloads. Kubernetes is the perfect testing ground for AI agents because it’s high-velocity and declarative. Things change constantly. Pods die; deployments roll over; services change selectors. A static script struggles to distinguish between a CrashLoopBackOff (a temporary error state) and a purposeful scale-down. We hypothesized that a large language model (LLM), acting as an operator, could understand this nuance. It wouldn’t just copy data; it would interpret it.

The Codex CLI turned this hypothesis into a tangible architecture by enabling a shift from “code generation” to “infrastructure operation.” Instead of treating the LLM as a junior programmer that writes scripts for humans to review and run, Codex empowers the model to execute code itself. We provide it with tools—executable functions that act as its hands and eyes—via the Model Context Protocol. MCP defines a clear interface between the AI model and the outside world, allowing us to expose high-level capabilities like cmdb_stage_transaction without teaching the model the complex internal API of our CMDB. The model learns to use the tool, not the underlying API.

The architecture of agency

Our system, which we call k8s-agent, consists of three distinct layers. This isn’t a single script running top to bottom; it’s a cognitive architecture.

The cognitive layer (Codex + contextual instructions): This is the Codex CLI running a specific system prompt. We don’t fine-tune the model weights. Infrastructure moves too fast for fine-tuning: A model trained on Kubernetes v1.25 would be hallucinating by v1.30. Instead, we use context engineering—the art of designing the environment in which the AI operates. This involves tool design (creating atomic, deterministic functions), prompt architecture (structuring the system prompt), and information architecture (deciding what information to hide or expose). We feed the model a persistent context file (AGENTS.md) that defines its persona: “You are a meticulous infrastructure auditor. Your goal is to ensure the CMDB accurately reflects the state of the Kubernetes cluster. You must prioritize safety: Do not delete records unless you have positive confirmation; they are orphans.”

The tool layer: Using MCP, we expose deterministic Python functions to the agent.

  • Sensorsk8s_list_workloadscmdb_query_servicek8s_get_deployment_spec
  • Actuatorscmdb_stage_createcmdb_stage_updatecmdb_stage_delete

Note that we track workloads (Deployments, StatefulSets), not Pods. Pods are ephemeral; tracking them in a CMDB is an antipattern that creates noise. The agent understands this distinction—a semantic rule that is hard to enforce in a rigid script.

The state layer (the safety net): LLMs are probabilistic; infrastructure must be deterministic. We bridge this gap with a staging pattern. The agent never writes directly to the production database. It writes to a staged diff. This allows a human (or a policy engine) to review the proposed changes before they are committed.

The OODA Loop in Action

How does this differ from a standard sync script? A script follows a linear path: Connect → Fetch → Write. If any step fails or returns unexpected data, the script crashes or corrupts data. Our agent follows the Observe-Orient-Decide-Act (OODA) loop, popularized by military strategists. Unlike a linear script that executes blindly, the OODA loop forces the agent to pause and synthesize information before taking action. This cycle allows it to handle incomplete data, verify assumptions, and adapt to changing conditions—traits essential for operating in a distributed system.

Let’s walk through a real scenario we encountered during our pilot, the Ghost Deployment, to explore the benefits of using an OODA loop. A developer had deleted a deployment named payment-processor-v1 from the cluster but forgot to remove the record from the CMDB. A standard script might pull the list of deployments, see payment-processor-v1 is missing, and immediately issue a DELETE to the database. The risk is obvious: What if the API server was just timing out? What if the script had a bug in its pagination logic? The script blindly destroys data based on the absence of evidence. 

The agent approach is fundamentally different. First, it observes: Calling k8s_list_workloads and cmdb_query_service, noticing the discrepancy. Second, it orients: Checking its context instructions to “verify orphans before deletion” and deciding to call k8s_get_event_history. Third, it decides: Seeing a “delete” event in the logs, it reasons that the resource is missing and that there’s been a deletion event. Finally, it acts: Calling cmdb_stage_delete with a comment confirming the deletion. The agent didn’t just sync data; it investigated. It handled the ambiguity that usually breaks automation.

Solving the “Semantic Gap”

This specific Kubernetes use case highlights a broader problem in IT operations: the “semantic gap.” The data in our infrastructure (JSON, YAML, logs) is full of implicit meaning. A label “env: production” changes the criticality of a resource. A status CrashLoopBackOff means “broken,” but Completed means “finished successfully.” Traditional scripts require us to hardcode every permutation of this logic, resulting in thousands of lines of unmaintainable if/else statements. With the Codex CLI, we replace those thousands of lines of code with a few sentences of English in the system prompt: “Ignore jobs that have completed successfully. Sync failing Jobs so we can track instability.” The LLM bridges the semantic gap. It understands what “instability” implies in the context of a job status. We’re describing our intent, and the agent is handling the implementation.

Scaling Beyond Kubernetes

We started with Kubernetes because it’s the “hard mode” of configuration management. In a production environment with thousands of workloads, things change constantly. A standard script sees a snapshot and often gets it wrong. An agent, however, can work through the complexity. It might run its OODA loop multiple times to solve a single issue—by checking logs, verifying dependencies, and confirming rules before it ever makes a change. This ability to connect reasoning steps allows it to handle the scale and uncertainty that breaks traditional automation.

But the pattern we established, agentic OODA Loops via MCP, is universal. Once we proved the model worked for Pods and Services, we realized we could extend it. For legacy infrastructure, we can give the agent tools to SSH into Linux VMs. For SaaS management, we can give it access to Salesforce or GitHub APIs. For cloud governance, we can ask it to audit AWS Security Groups. The beauty of this architecture is that the “brain” (the Codex CLI) stays the same. To support a new environment, we don’t need to rewrite the engine; we just hand it a new set of tools. However, shifting to an agentic model forces us to confront new trade-offs. The most immediate is cost versus context. We learned the hard way that you shouldn’t give the AI the raw YAML of a Kubernetes deployment—it consumes too many tokens and distracts the model with irrelevant details. Instead, you create a tool that returns a digest—a simplified JSON object with only the fields that matter. This is context optimization, and it is the key to running agents cost-effectively.

Conclusion: The Human in the Cockpit

There’s a fear that AI will replace the DevOps engineer. Our experience with the Codex CLI suggests the opposite. This technology does not remove the human; it elevates them. It promotes the engineer from a “script writer” to a “mission commander.” The stale CMDB was never really a data problem; it was a labor problem. It was simply too much work for humans to manually track and too complex for simple scripts to automate. By introducing an agent that can reason, we finally have a mechanism capable of keeping up with the cloud. 

We started with a small Kubernetes cluster. But the destination is an infrastructure that is self-documenting, self-healing, and fundamentally intelligible. The era of the brittle sync script is over. The era of infrastructure as intent has begun!

AI and the Next Economy

7 January 2026 at 07:20

The narrative from the AI labs is dazzling: build AGI, unlock astonishing productivity, and watch GDP surge. It’s a compelling story, especially if you’re the one building or investing in the new thought machines. But it skips the part that makes an economy an economy: circulation.

An economy is not simply production. It is production matched to demand, and demand requires broadly distributed purchasing power. When we forget that, we rediscover an old truth the hard way: You can’t build a prosperous society that leaves most people on the sidelines.

In The Marriage of Heaven and Hell, the visionary poet and painter William Blake (writing during the first Industrial Revolution) put the circulatory logic perfectly: “The Prolific would cease to be prolific unless the Devourer as a sea received the excess of his delights.” In other words: Output has to be consumed. The system has to flow.

The Marriage of Heaven and Hell
Image created with Gemini and Nano Banana Pro

Today, many AGI narratives assume that the “prolific” can keep producing and the broad mass of customers (“the devourer”) somehow continue to buy, even as more and more human labor is displaced and labor income and bargaining power collapses. That’s not a future of abundance. It’s a recipe for a kind of congestive heart failure for the economy: Profits and capabilities accumulate in what should be the circulatory pump, while the rest of the body is starved.

So if we want an AI economy that makes society richer, we need to ask not just “How smart will the models get?” and “How rich will AI developers, their investors, and their immediate customers get?” but “How will the value circulate in the real economy of goods and services?” Not “What can we automate?” but “What new infrastructure and institutions are needed to turn capability into widely shared prosperity?”

Two versions of the future are often discussed as if they are separate. They’re not.

The Discovery Economy: Capability Is Not GDP

I’m excited by the discovery potential of AI. It may help us solve problems that have defied us for decades: energy abundance, new materials, cures for diseases. As Nick Hanauer and Eric Beinhocker put it so well, “Prosperity is the accumulation of solutions to human problems.” That AI can grow the store of solutions to human problems is a wonderful dream, and it should be our goal to make it come true.

But discovery alone is not the same thing as economic value, and it certainly isn’t the same thing as widely shared prosperity. Between discovery and economic value lies a long, failure-prone pipeline: productization, validation, regulation, manufacturing, distribution, training, and maintenance. The valley of death is not a metaphor; it is a bureaucratic, technical, and financial landscape where many promising advances go to die. And from that valley of death, the path follows either an ascent to the broad uplands of shared prosperity, or a shortcut to a dead-end peak of wealth concentration.

If AI accelerates discovery but doesn’t accelerate diffusion, we get headlines and paper wealth, but broad-based growth takes much longer to arrive. We get a taller peak, not a wider plateau.

The distribution question begins with choke points. Who owns the discovery engines? Who controls access to compute, data, and the models themselves? Who captures the IP? Who has the channels to bring new capabilities to market? To what extent do incumbents and the moats they have built restrict innovation? Do government regulatory processes also speed up, or do they keep AI adoption at a glacial pace? Do those at the choke points use their market shaping power wisely? If those choke points are tight, the discovery economy becomes a kind of discovery feudalism: The breakthroughs happen, but the spillovers are limited, adoption is slow, and the returns concentrate.

If, on the other hand, the tools and standards of diffusion are broadly available, if interoperability is real, if licensing is designed for many routes to market, if regulatory processes can also be sped up with AI, then the discovery economy can become what we want it to be: a generalized engine of progress. There’s a huge amount of work to be done here.

Many of the questions are economic. If discovery becomes cheap, does the rest of the pipeline get cheaper, or does it get more expensive to compensate for other lost revenue? The happy dream is that a cancer vaccine becomes available at the marginal cost of production. The unhappy reality may be that the drug manufacturers conclude “We have to price this high to make up for our losses from the existing drugs that people no longer need to buy.” Even in an age of cheap discovery, it is possible that some vaccines will still cost millions of dollars per dose and only be available to people who can afford them.

The Labor Replacement Economy: Demand Is the Constraint

The other story is labor replacement. We are told that AI will substitute for a great deal of intellectual work, much as machines replaced animal labor and much of human manual labor. Businesses become more efficient. Margins rise. Output increases. Prices fall and spending power increases for those who are still employed.

But who are the customers when a large number of humans are suddenly no longer gainfully employed?

This is not a rhetorical question. It is the central macroeconomic constraint that much of Silicon Valley prefers not to model. You can’t replace wages with cheap inference and expect the consumer economy to hum along unchanged. If the wage share falls fast enough, the economy may become less stable. Social conflict rises. Politics turns punitive. Investment in long-term complements collapses. And the whole system starts behaving like a fragile rent-extraction machine rather than a durable engine of prosperity.

In a 2012 Harvard Business Review article, Michael Schrage asked a powerful strategic question: “Who do you want your customers to become?” As he put it, the answer to that question is the true foundation of great companies. “Successful companies have a ‘vision of the customer future’ that matters every bit as much as their vision of their products.”

In the early days of mass production, Henry Ford reportedly understood that if you want mass markets, you need mass purchasing power. He paid higher wages and reduced working hours, helping to invent what we now call the weekend, and with it, the leisure economy. The productivity dividend was distributed in ways that created new customers.

Ford’s innovation had consequences beyond the factory gate. Mass adoption of cars required a vast extension of infrastructure: roads, traffic rules, hotels, parking, gas stations, repair shops, and the entire social reorganization of distance. The technology mattered, but the complements made it an economy.

Steven Johnson tells a related story in his book Wonderland. The preindustrial European desire for Indian calico and chintz helped catalyze modern shopping environments and global trade networks. But there’s even more to that story. When it became cheaper to make cloth, fashion, taste, and the democratization of status display became a larger part of the economy. The point is not “consumerism is good.” The point is that economies grow because desires and capabilities change as the result of innovations, infrastructure, and institutions that allow the benefits to spread. New forms of production require new systems of distribution, experience, and exchange.

AI is at that inflection point now. We may be building the engines of extraordinary productivity, but we are not yet building the social machinery that will make that productivity broadly usable and broadly beneficial. We are just hoping that they somehow evolve.

This failure of insight and imagination is the Achilles’ heel of today’s AI giants. They imagine themselves as contestants in a race to be the next dominant platform, with the majority of the benefits going to whoever has the smartest model, the most users, and the most developers. This is not unlike the vision of Marc Andreessen’s Netscape in the early days of the web. Netscape sought to replace Microsoft Windows as the platform for users and developers, using the internet moment to become the next monopoly gatekeeper. Instead, victory went to those who embraced the web’s architecture of participation.

Now, it is true that 30 years later, we are in a world where companies such as Google, Apple, Amazon, and Meta have indeed become gatekeepers, extracting huge economic rents via their control over human attention. But it didn’t start that way. Amazon and Google in particular rose to prominence because they solved the circulation problem. Amazon’s flywheel, in which more users draw in more suppliers with more and cheaper products, which in turn brings in more users, in a virtuous circle, is a great example of an economic circulation strategy. Not only did Amazon drive enormous consumer value, they created a whole new set of suppliers.

So too, Google’s original search engine strategy was also deeply rooted in the circulation of value. As Larry Page put it in 2004, “The portal strategy tries to own all of the information….We want to get you out of Google and to the right place as fast as possible.” The company’s algorithms for both search and ad relevance were a real advance in market coordination and shared value creation. Economists like Hal Varian were brought in to design advertising models that were better not only for Google but for its customers. Google grew along with the web economy it helped to create, not at its expense. Yes, that changed over time, but let’s not forget how important Google’s support for a circulatory economy was to its initial success.

Google also provides a really good example of mechanism design to solve problems with rights holders that have economic lessons for today. When music companies sent takedown notices to YouTube for user-generated content that made unauthorized use of their IP, YouTube instead asked, “How about we help you monetize it instead?” In the process it created a new market.

The extent to which Amazon and Google seem to have forgotten these lessons is a sign of their decline, not something to be emulated. It provides an opportunity for those (including Google and Amazon, if they recommit to their roots!) who are building the next generation of technology platforms. Build a flywheel, enable a circulatory economy. AI should not be enshittified from the beginning, prioritizing value capture over broadly based value creation.

Decentralized Architectures Create Value; Centralization Captures It

An important lesson from the internet technology revolution of the 1990s and early 2000s is that decentralized architectures are more innovative and more competitive than those that are centralized. Decentralization creates value; centralization captures it. The PC decentralized the computer industry, ending IBM’s chokehold on competition during the mainframe era. The new software industry exploded. Over the next few decades, as it became dominant, Microsoft recentralized the industry by monopolizing operating systems and office applications in the way that IBM had monopolized computer hardware. The personal computer software industry began to stagnate, until open source software and the open protocols of the internet undermined Microsoft’s centralized control over the industry and ushered in a new era of innovation.

The tragedy began again, as those who had once flourished as internet innovators in turn began to prioritize control, raising moats and extracting rents rather than continuing to innovate, leading to today’s internet oligopoly. This, of course, is what allowed the current AI revolution to happen as it did. Google invented the transformer architecture, and then published it freely, but did not itself fully explore the possibilities because it was protecting an existing business model. So it was left to OpenAI to invent the future.

However, the AI revolution has a significant difference from the early internet. The U.S.’s current set up of large, closed models, enormous data centers for model training, and a highly concentrated cloud market has echoes of central planning, in which a small cadre of deep pocketed investors choose the winners at the outset rather than discovering them through a period of intense market competition and finding product-market fit (which involves finding products and services that users not only want but are willing to pay for at less than the cost of production!).

Market competition is important to ensuring that the economy is not reliant on a handful of firms reinvesting their profits into production. When this becomes the case, circulation can get cut off. Profits stop being reinvested and instead become hoarded, trapped within the sphere of financial circulation, from dividends to share buybacks to more dividends and less and less to investment in fixed or human capital.

If we are to realize the full potential of AI to reinvigorate and reinvent the economy, we need to embrace decentralized architectures. This might involve the triumph of lower-cost open weight models that commoditize and decentralize inference, and it also certainly entails protocols and technical infrastructure that can reduce the inherent concentrating tendencies of economies of scale and other technological moats that make concentration a more efficient mode of production.

Centralization is an advantage in a mature economy; it is a disadvantage when you are trying to invent the future. Premature centralization is a mistake.

A Manifesto for a Circulatory AI Economy

If AI labs wish to be architects of a prosperous future, they must work as hard on inventing the new economy’s circulatory system as they do on improving model capabilities. They need to measure success by diffusion, not just capability. They have to treat the labor transition as a core problem to be solved, not just studied. They have to be willing to win in the marketplace, not through artificial moats. That means committing to open interfaces, portability, and interoperability. General-purpose capabilities should not become a private toll road.

Companies adopting AI face their own challenges. Simply using AI to slash costs and turbocharge profits is a kind of failure. The productivity dividend should show up for employees not as a pink slip but as some combination of higher pay, reduced hours, profit-sharing, and investment in retraining. They must use the opportunity to reinvent themselves by creating new kinds of value that people will be eager to pay for, not just trying to preserve what they have.

Governments and society as a whole need to invest in the complements that will shape the new AI economy. Diffusion will be limited by the fragility of our energy grid, by bottlenecks in the supply of rare earths, but also by sclerotic approval processes for new construction or the approval of new innovations.

Governments must also develop scenarios for a future in which taxes on labor might provide a much smaller part of their income. Solutions are not obvious, and transitions will be hard, but if we face a future where capital appreciation is abundant and labor income is scarce, perhaps it’s time to consider reducing taxes on labor and increasing those on capital gains.

Over the next few months, we intend to convene a series of conversations and to publish a series of more detailed action plans in each of these areas. Let me know if you think you have ideas to contribute.

The Choice

We can build an AI economy that concentrates value, hollows out demand, and forces society into a reactive cycle of backlash and repair. Or we can build an AI economy that circulates, where discoveries diffuse, where productivity dividends translate into purchasing power and time, and where the complements are built fast enough that society becomes broadly more capable.

AI labs like to say they are building intelligence. They are making good progress. But if they want to build prosperity, they also need to discover the flywheel for the AI economy.

The prolific needs the devourer. Not as a villain, not as an obstacle, but as the sea that receives the excess, and returns it, transformed, as the next wave of demand, innovation, and shared flourishing.

MCPs for Developers Who Think They Don’t Need MCPs

5 January 2026 at 06:01

The following article originally appeared on Block’s blog and is being republished here with the author’s permission.

Lately, I’ve seen more developers online starting to side eye MCP. There was a tweet by Darren Shepherd that summed it up well:

Most devs were introduced to MCP through coding agents (Cursor, VS Code) and most devs struggle to get value out of MCP in this use case…so they are rejecting MCP because they have a CLI and scripts available to them which are way better for them.

Fair. Most developers were introduced to MCPs through some chat-with-your-code experience, and sometimes it doesn’t feel better than just opening your terminal and using the tools you know. But here’s the thing…

MCPs weren’t built just for developers.

They’re not just for IDE copilots or code buddies. At Block, we use MCPs across everything, from finance to design to legal to engineering. I gave a whole talk on how different teams are using goose, an AI agent. The point is MCP is a protocol. What you build on top of it can serve all kinds of workflows.

But I get it… Let’s talk about the dev-specific ones that are worth your time.

GitHub: More Than Just the CLI

If your first thought is “Why would I use GitHub MCP when I have the CLI?” I hear you. GitHub’s MCP is kind of bloated right now. (They know. They’re working on it.)

But also: You’re thinking too local.

You’re imagining a solo dev setup where you’re in your terminal, using GitHub CLI to do your thing. And honestly, if all you’re doing is opening a PR or checking issues, you probably should use the CLI.

But the CLI was never meant to coordinate across tools. It’s built for local, linear commands. But what if your GitHub interactions happened somewhere else entirely?

MCP shines when your work touches multiple systems like GitHub, Slack, and Jira without you stitching it together.

Here’s a real example from our team:

Slack thread. Real developers in real time.

Dev 1: I think there’s a bug with xyz

Dev 2: Let me check… yep, I think you’re right.

Dev 3: @goose is there a bug here?

goose: Yep. It’s in these lines… [code snippet]

Dev 3: Okay @goose, open an issue with the details. What solutions would you suggest?

goose: Here are 3 suggestions: [code snippets with rationale]

Dev 1: I like Option 1

Dev 2: me too

Dev 3: @goose, implement Option 1

goose: Done. Here’s the PR.

All of that happened in Slack. No one opened a browser or a terminal. No one context-switched. Issue tracking, triaging, discussing fixes, implementing code in one thread in a five-minute span.

We’ve also got teams tagging Linear or Jira tickets and having goose fully implement them. One team had goose do 15 engineering days worth of work in a single sprint. The team literally ran out of tasks and had to pull from future sprints. Twice!

So yes, GitHub CLI is great. But MCP opens the door to workflows where GitHub isn’t the only place where dev work happens. That’s a shift worth paying attention to.

Context7: Docs That Don’t Suck

Here’s another pain point developers hit: documentation.

You’re working with a new library. Or integrating an API. Or wrestling with an open source tool.

The Context7 MCP pulls up-to-date docs, code examples, and guides right into your AI agent’s brain. You just ask and get answers to questions like:

  • How do I create a payment with the Square SDK?
  • What’s the auth flow for Firebase?
  • Is this library tree-shakable?

It doesn’t rely on stale LLM training data from two years ago. It scrapes the source of truth right now. Giving it updated…say it with me…CONTEXT.

Developer “flow” is real, and every interruption steals precious focus time. This MCP helps you figure out new libraries, troubleshoot integrations, and get unstuck without leaving your IDE.

Repomix: Know the Whole Codebase Without Reading It

Imagine you join a new project or want to contribute to an open source one, but it’s a huge repo with lots of complexity.

Instead of poking around for hours trying to draw an architectural diagram in your head, you just tell your agent: “goose, pack this project up.”

It runs Repomix, which compresses the entire codebase into an AI-optimized file. From there, your convo might go like this:

  • Where’s the auth logic?
  • Show me how API calls work.
  • What uses UserContext?
  • What’s the architecture?
  • What’s still a TODO?

You get direct answers with context, code snippets, summaries, and suggestions. It’s like onboarding with a senior dev who already knows everything. Sure, you could grep around and piece things together. But Repomix gives you the whole picture—structure, metrics, patterns—compressed and queryable.

And it even works with remote public GitHub repos, so you don’t need to clone anything to start exploring.

This is probably my favorite dev MCP. It’s a huge time saver for new projects, code reviews, and refactoring.

Chrome DevTools MCP: Web Testing While You Code

The Chrome DevTools MCP is a must-have for frontend devs. You’re building a new form/widget/page/whatever. Instead of opening your browser, typing stuff in, and clicking around, you just tell your agent: “Test my login form on localhost:3000. Try valid and invalid logins. Let me know what happens.”

Chrome opens, test runs, screenshots captured, network traffic logged, console errors noted. All done by the agent.

This is gold for frontend devs who want to actually test their work before throwing it over the fence.

Could you script all this with CLIs and APIs? Sure, if you want to spend your weekend writing glue code. But why would you want to do that when MCP gives you that power right out of the box… in any MCP client?!

So no, MCPs are not overhyped. They’re how you plug AI into everything you use: Slack, GitHub, Jira, Chrome, docs, codebases—and make that stuff work together in new ways.

Recently, Anthropic called out the real issue: Most dev setups load tools naively, bloat the context, and confuse the model. It’s not the protocol that’s broken. It’s that most people (and agents) haven’t figured out how to use it well yet. Fortunately, goose has—it manages MCPs by default, enabling and disabling as you need them.

But I digress.

Step outside the IDE, and that’s when you really start to see the magic.

PS Happy first birthday, MCP! 🎉

If You’ve Never Broken It, You Don’t Really Know It

17 December 2025 at 09:54

The following article originally appeared on Medium and is being republished here with the author’s permission.

There’s a fake confidence you can carry around when you’re learning a new technology. You watch a few videos, skim some docs, get a toy example working, and tell yourself, “Yeah, I’ve got this.” I’ve done that. It never lasts. A difficult lesson often accompanies the only experience that matters.

You learn through failure—falling flat on your face, looking at the mess, and figuring out why it broke. Anything that feels too easy? It probably was, and you didn’t exit the process with anything worth learning.

Ask About Failure: Failure === Experience

When I’m hiring someone who claims relational database expertise, I ask a “trick” question:

Tell me about the worst database schema you ever created. What did it teach you to avoid?

It’s not really a trick. Anyone who’s been knee‑deep in relational databases knows there’s no perfect schema. There are competing use cases that constantly pull against each other. You design for transaction workloads, but inevitably, someone tries to use it for reporting, then everyone wonders why queries crawl. Another developer on the team inadvertently optimizes the schema (usually years later) for the reporting use case only to make the transactional workload unworkable.

The correct answer usually sounds like:

We built for transactional throughput—one of the founders of the company thought MySQL was a database, which was our first mistake. The business then used it for reporting purposes. The system changed hands several times over the course of several years. Joins became gnarly, indices didn’t match the access patterns, and nightly jobs started interfering with user traffic. We had to split read replicas, eventually introduce a warehouse, and after 5–6 years, we ended up simplifying the transactions and moving them over to Cassandra.

That’s a person who has lived the trade-offs. They’ve experienced a drawn-out existential failure related to running a database. While they might not know how to solve some of the silly logic questions that are increasingly popular in job interviews, this is the sort of experience that carries far more weight with me.

The Schema That Nearly Broke Me

I once shipped a transactional schema that looked fine on paper: normalized, neat, everything in its proper place.

Then analytics showed up with “just a couple of quick dashboards.” Next thing you know, my pretty 3NF model, now connected to every elementary classroom in America, was being used like a million-row Excel spreadsheet to summarize an accounting report. For a few months, it was fine until it wasn’t, and the database had made a slow‑motion faceplant because it was spending 80% of its time updating an index. It wasn’t as if I could fix anything, because that would mean several days of downtime coupled with a rewrite for a project whose contract was almost up.

And how were we trying to fix it? If you’ve been in this situation, you’ll understand that what I’m about to write is the sign that you have reached a new level of desperate failure. Instead of considering a rational approach to reform the schema or separating what had become a “web-scale” workload in 2007 from a NoSQL database, we were trying to figure out how to purchase faster hard drives with higher IOPS.

I learned a lot of things:

  • I learned that upgrading hardware (buying a faster machine or dropping a million dollars on hard drives) will only delay your crisis. The real fix is unavoidable—massive horizontal scaling is incompatible with relational databases.
  • I learned the meaning of “query plan from hell.” We band‑aided it with materialized views and read replicas. Then we did what we should’ve done from day one: set up an actual reporting path.
  • If you are having to optimize for a query plan every week? Your database is sending you an important signal, which you should translate to, “It’s time to start looking for an alternative.”

Lesson burned in: Design for the use case you actually have, not the one you hope to have—and assume the use case will change.

What Does This Have to Do with Cursor and Copilot?

I’m seeing a lot of people writing on LinkedIn and other sites about how amazing vibe coding is. These celebratory posts reveal more about the people posting them than they realize, as they rarely acknowledge the reality of the process—it’s not all fun and games. While it is astonishing how much progress one can make in a day or a week, those of us who are actually using these tools to write code are the first to tell you that we’re learning a lot of difficult lessons.

It’s not “easy.” There’s nothing “vibey” about the process, and if you are doing it right, you are starting to use curse words in your prompts. For example, some of my prompts in response to a Cursor Agent yesterday were: “You have got to be kidding me, I have a rule that stated that I never wanted you to do that, and you just ignored it?”

Whenever I see people get excited about the latest, greatest fad thing that’s changing the world, I’m also the first to notice that maybe they aren’t using it all. If they were, they’d understand that it’s not as “easy” as they are reporting.

The failure muscle you build with databases is the same one you need with AI coding tools. You can’t tiptoe in. You have to push until something breaks. Then you figure out how to approach a new technology as a professional.

  • Ask an agent to refactor one file—great.
  • Ask it to coordinate changes across 20 files, rethink error handling, and keep tests passing—now we’re learning.
  • Watch where it stumbles, and learn to frame the work so it can succeed next time.
  • Spend an entire weekend on a “wild goose chase” because your agentic coder decided to ignore your Cursor rules completely. ← This is expensive, but it’s how you learn.

The trick isn’t avoiding failure. It’s failing in a controlled, reversible way.

The Meta Lesson

If you’ve never broken it, you don’t really know it. This is true for coding, budgeting, managing, cooking, and skiing. If you haven’t failed, you don’t know it. And most of the people talking about “vibe coding” haven’t.

The people I trust most as engineers can tell me why something failed and how they adjusted their approach as a result. That’s the entire game with AI coding tools. The faster you can run the loop—try → break → inspect → refine—the better you get.

AI, MCP, and the Hidden Costs of Data Hoarding

15 December 2025 at 08:15

The Model Context Protocol (MCP) is genuinely useful. It gives people who develop AI tools a standardized way to call functions and access data from external systems. Instead of building custom integrations for each data source, you can expose databases, APIs, and internal tools through a common protocol that any AI can understand.

However, I’ve been watching teams adopt MCP over the past year, and I’m seeing a disturbing pattern. Developers are using MCP to quickly connect their AI assistants to every data source they can find—customer databases, support tickets, internal APIs, document stores—and dumping it all into the AI’s context. And because the AI is smart enough to sort through a massive blob of data and pick out the parts that are relevant, it all just works! Which, counterintuitively, is actually a problem. The AI cheerfully processes massive amounts of data and produces reasonable answers, so nobody even thinks to question the approach.

This is data hoarding. And like physical hoarders who can’t throw anything away until their homes become so cluttered they’re unliveable, data hoarding has the potential to cause serious problems for our teams. Developers learn they can fetch far more data than the AI needs and provide it with little planning or structure, and the AI is smart enough to deal with it and still give good results.

When connecting a new data source takes hours instead of days, many developers don’t take the time to ask what data actually belongs in the context. That’s how you end up with systems that are expensive to run and impossible to debug, while an entire cohort of developers misses the chance to learn the critical data architecture skills they need to build robust and maintainable applications.

How Teams Learn to Hoard

Anthropic released MCP in late 2024 to give developers a universal way to connect AI assistants to their data. Instead of maintaining separate code for connectors to let AI access data from, say, S3, OneDrive, Jira, ServiceNow, and your internal DBs and APIs, you use the same simple protocol to provide the AI with all sorts of data to include in its context. It quickly gained traction. Companies like Block and Apollo adopted it, and teams everywhere started using it. The promise is real; in many cases, the work of connecting data sources to AI agents that used to take weeks can now take minutes. But that speed can come at a cost.

Let’s start with an example: a small team working on an AI tool that reads customer support tickets, categorizes them by urgency, suggests responses, and routes them to the right department. They needed to get something working quickly but faced a challenge: They had customer data spread across multiple systems. After spending a morning arguing about what data to pull, which fields were necessary, and how to structure the integration, one developer decided to just build it, creating a single getCustomerData(customerId) MCP tool that pulls everything they’d discussed—40 fields from three different systems—into one big response object. To the team’s relief, it worked! The AI happily consumed all 40 fields and started answering questions, and no more discussions or decisions were needed. The AI handled all the new data just fine, and everyone felt like the project was on the right track.

Day two, someone added order history so the assistant could explain refunds. Soon the tool pulled Zendesk status, CRM status, eligibility flags that contradicted each other, three different name fields, four timestamps for “last seen,” plus entire conversation threads, and combined them all into an ever-growing data object.

The assistant kept producing reasonable-looking answers, even as the data it ingested kept growing in scale. However, the model now had to wade through thousands of irrelevant tokens before answering simple questions like “Is this customer eligible for a refund?” The team ended up with a data architecture that buried the signal in noise. That additional load put stress on the AI to dig out that signal, leading to serious potential long-term problems. But they didn’t realize it yet, because the AI kept producing reasonable-looking answers. As they added more data sources over the following weeks, the AI started taking longer to respond. Hallucinations crept in that they couldn’t track down to any specific data source. What had been a really valuable tool became a bear to maintain.

The team had fallen into the data hoarding trap: Their early quick wins created a culture where people just threw whatever they needed into the context, and eventually it grew into a maintenance nightmare that only got worse as they added more data sources.

The Skills That Never Develop

There are as many opinions on data architecture as there are developers, and there are usually many ways to solve any one problem. One thing that almost everyone agrees on is that it takes careful choices and lots of experience. But it’s also the subject of lots of debate, especially within teams, precisely because there are so many ways to design how your application stores, transmits, encodes, and uses data.

Most of us fall into just-in-case thinking at one time or another, especially early in our careers—pulling all the data we might possibly need just in case we need it rather than fetching only what we need when we actually need it (which is an example of the opposite, just-in-time thinking). Normally when we’re designing our data architecture, we’re dealing with immediate constraints: ease of access, size, indexing, performance, network latency, and memory usage. But when we use MCP to provide data to an AI, we can often sidestep many of those trade-offs…temporarily.

The more we work with data, the better we get at designing how our apps use it. The more early-career developers are exposed to it, the more they learn through experience why, for example, System A should own customer status while System B owns payment history. Healthy debate is an important part of this learning process. Through all of these experiences, we develop an intuition for what “too much data” looks like—and how to handle all of those tricky but critical trade-offs that create friction throughout our projects.

MCP can remove the friction that comes from those trade-offs by letting us avoid having to make those decisions at all. If a developer can wire up everything in just a few minutes, there’s no need for discussion or debate about what’s actually needed. The AI seems to handle whatever data you throw at it, so the code ships without anyone questioning the design.

Without all of that experience making, discussing, and debating data design choices, developers miss the chance to build critical mental models about data ownership, system boundaries, and the cost of moving unnecessary data around. They spend their formative years connecting instead of architecting. This is another example of what I call the cognitive shortcut paradox—AI tools that make development easier can prevent developers from building the very skills they need to use those tools effectively. Developers who rely solely on MCP to handle messy data never learn to recognize when data architecture is problematic, just like developers who rely solely on tools like Copilot or Claude Code to generate code never learn to debug what it creates.

The Hidden Costs of Data Hoarding

Teams use MCP because it works. Many teams carefully plan their MCP data architecture, and even teams that do fall into the data hoarding trap still ship successful products. But MCP is still relatively new, and the hidden costs of data hoarding take time to surface.

Teams often don’t discover the problems with a data hoarding approach until they need to scale their applications. That bloated context that barely registered as a cost for your first hundred queries starts showing up as a real line item in your cloud bill when you’re handling millions of requests. Every unnecessary field you’re passing to the AI adds up, and you’re paying for all that redundant data on every single AI call.

Any developer who’s dealt with tightly coupled classes knows that when something goes wrong—and it always does, eventually—it’s a lot harder to debug. You often end up dealing with shotgun surgery, that really unpleasant situation where fixing one small problem requires changes that cascade across multiple parts of your codebase. Hoarded data creates the same kind of technical debt in your AI systems: When the AI gives a wrong answer, tracking down which field it used or why it trusted one system over another is difficult, often impossible.

There’s also a security dimension to data hoarding that teams often miss. Every piece of data you expose through an MCP tool is a potential vulnerability. If an attacker finds an unprotected endpoint, they can pull everything that tool provides. If you’re hoarding data, that’s your entire customer database instead of just the three fields actually needed for the task. Teams that fall into the data hoarding trap find themselves violating the principle of least privilege: Applications should have access to the data they need, but no more. That can bring an enormous security risk to their whole organization.

In an extreme case of data hoarding infecting an entire company, you might discover that every team in your organization is building their own blob. Support has one version of customer data, sales has another, product has a third. The same customer looks completely different depending on which AI assistant you ask. New teams come along, see what appears to be working, and copy the pattern. Now you’ve got data hoarding as organizational culture.

Each team thought they were being pragmatic, shipping fast, and avoiding unnecessary arguments about data architecture. But the hoarding pattern spreads through an organization the same way technical debt spreads through a codebase. It starts small and manageable. Before you know it, it’s everywhere.

Practical Tools for Avoiding the Data Hoarding Trap

It can be really difficult to coach a team away from data hoarding when they’ve never experienced the problems it causes. Developers are very practical—they want to see evidence of problems and aren’t going to sit through abstract discussions about data ownership and system boundaries when everything they’ve done so far has worked just fine.

In Learning Agile, Jennifer Greene and I wrote about how teams resist change because they know that what they’re doing today works. To the person trying to get developers to change, it may seem like irrational resistance, but it’s actually pretty rational to push back against someone from the outside telling them to throw out what works today for something unproven. But just like developers eventually learn that taking time for refactoring speeds them up in the long run, teams need to learn the same lesson about deliberate data design in their MCP tools.

Here are some practices that can make those discussions easier, by starting with constraints that even skeptical developers can see the value in:

  • Build tools around verbs, not nouns. Create checkEligibility() or getRecentTickets() instead of getCustomer(). Verbs force you to think about specific actions and naturally limit scope.
  • Talk about minimizing data needs. Before anyone creates an MCP tool, have a discussion about what the smallest piece of data they need to provide for the AI to do its job is and what experiments they can run to figure out what the AI truly needs.
  • Break reads apart from reasoning. Separate data fetching from decision-making when you design your MCP tools. A simple findCustomerId() tool that returns just an ID uses minimal tokens—and might not even need to be an MCP tool at all, if a simple API call will do. Then getCustomerDetailsForRefund(id) pulls only the specific fields needed for that decision. This pattern keeps context focused and makes it obvious when someone’s trying to fetch everything.
  • Dashboard the waste. The best argument against data hoarding is showing the waste. Track the ratio of tokens fetched versus tokens used and display them in an “information radiator” style dashboard that everyone can see. When a tool pulls 5,000 tokens but the AI only references 200 in its answer, everyone can see the problem. Once developers see they’re paying for tokens they never use, they get very interested in fixing it.

Quick smell test for data hoarding

  • Tool names are nouns (getCustomer()) instead of verbs (checkEligibility()).
  • Nobody’s ever asked, “Do we really need all these fields?”
  • You can’t tell which system owns which piece of data.
  • Debugging requires detective work across multiple data sources.
  • Your team rarely or never discusses the data design of MCP tools before building them.

Looking Forward

MCP is a simple but powerful tool with enormous potential for teams. But because it can be a critically important pillar of your entire application architecture, problems you introduce at the MCP level ripple throughout your project. Small mistakes have huge consequences down the road.

The very simplicity of MCP encourages data hoarding. It’s an easy trap to fall into, even for experienced developers. But what worries me most is that developers learning with these tools right now might never learn why data hoarding is a problem, and they won’t develop the architectural judgment that comes from having to make hard choices about data boundaries. Our job, especially as leaders and senior engineers, is to help everyone avoid the data hoarding trap.

When you treat MCP decisions with the same care you give any core interface—keeping context lean, setting boundaries, revisiting them as you learn—MCP stays what it should be: a simple, reliable bridge between your AI and the systems that power it.

Building Applications with AI Agents

12 December 2025 at 08:07

Following the publication of his new book, Building Applications with AI Agents, I chatted with author Michael Albada about his experience writing the book and his thoughts on the field of AI agents.

Michael’s a machine learning engineer with nine years of experience designing, building, and deploying large-scale machine learning solutions at companies such as Uber, ServiceNow, and more recently, Microsoft. He’s worked on recommendation systems, geospatial modeling, cybersecurity, natural language processing, large language models, and the development of large-scale multi-agent systems for cybersecurity.

What’s clear from our conversation is that writing a book on AI these days is no small feat, but for Michael, the reward of the final result was well-worth the time and effort. We also discussed the writing process, the struggle of keeping up with a fast-paced field, Michael’s views on SLMs and fine-tuning, and his latest work on Autotune at Microsoft.

Here’s our conversation, edited slightly for clarity.

Nicole Butterfield: What inspired you to write this book about AI agents originally? When you initially started this endeavor, did you have any reservations?

Michael Albada: When I joined Microsoft to work in the Cybersecurity Division, I knew that organizations were facing greater speed, scale, and complexity of attacks than they could manage, and it was both expensive and difficult. There are simply not enough cybersecurity analysts on the planet to help protect all these organizations, and I was really excited about using AI to help solve that problem.

It became very clear to me that this agentic pattern of design was an exciting new way to build that was really effective—and that these language models and reasoning models as autoregressive models generate tokens. Those tokens can be function signatures and can call additional functions to retrieve additional information and execute tools. And it was clear to me [that they were] going to really transform the way that we were going to do a lot of work, and it was going to transform a lot of the way that we do software engineering. But when I looked around, I did not see good resources on this topic.

And so, as I was giving presentations internally at Microsoft, I realized there’s a lot of curiosity and excitement, but people had to go straight to research papers or sift through a range of blog posts. I started putting together a document that I was going to share with my team, and I realized that this was something that folks across Microsoft and even across the entire industry were going to benefit from. And so I decided to really take it up as a more comprehensive project to be able to share with the wider community.

Did you have any initial reservations about taking on writing an entire book? I mean you had a clear impetus; you saw the need. But it is your first book, right? So was there anything that you were potentially concerned about starting the endeavor?

I’ve wanted to write a book for a very long time, and very specifically, I especially enjoyed Designing Machine Learning Systems by Chip Huyen and really looked up to her as an example. I remember reading O’Reilly books earlier. I was fortunate enough to also see Tim O’Reilly give a talk at one point and just really appreciated that [act] of sharing with the larger community. Can you imagine what software engineering would look like without resources, without that type of sharing? And so I always wanted to pay that forward. 

I remember as I was first getting into computer science hoping at one point in time I would have enough knowledge and expertise to be able to write my own book. And I think that moment really surprised me, as I looked around and realized I was working on agents and running experiments and seeing these things work and seeing that no one else had written in this space. That moment to write a book seems to be right now. 

Certainly I had some doubts about whether I was ready. I had not written a book before and so that’s definitely an intimidating project. The other big doubt that I had is just how fast the field moves. And I was afraid that if I were to take the time to write a book, how relevant might it still be even by the time of publication, let alone how well is it going to stand the test of time? And I just thought hard about it and I realized that with a big design pattern shift like this, it’s going to take time for people to start designing and building these types of agentic systems. And many of the fundamentals are going to stay the same. And so the way I tried to address that is to think beyond an individual framework [or] model and really think hard about the fundamentals and the principles and write it in such a way that it’s both useful and comes along with code that people can use, but really focuses on things that’ll hopefully stand the test of time and be valuable to a wider audience for a longer period.

Yeah, you absolutely did identify an opportunity! When you approached me with the proposal, it was on my mind as well, and it was a clear opportunity. But as you said, the concern about how quickly things are moving in the field is a question that I have to ask myself about every book that we sign. And you have some experience in writing this book, adjusting to what was happening in real time. Can you talk a little bit about your writing process, taking all of these new technologies, these new concepts, and writing these into a clear narrative that is captivating to this particular audience that you targeted, at a time when everything is moving so quickly?

I initially started by drafting a full outline and just getting the sort of rough structure. And as I look back on it, that rough structure has really held from the beginning. It took me a little over a year to write the book. And my writing process was to do a basically “thinking fast and slow” approach. I wanted to go through and get a rough draft of every single chapter laid out so that I really knew sort of where I was headed, what the tricky parts were going to be, where the logic gap might be too big if someone were to skip around chapters. I wanted [to write] a book that would be enjoyable start to finish but would also serve as a valuable reference if people were to drop in on any one section. 

And to be honest, I think the changes in frameworks were much faster than I expected. When I started, LangChain was the clear leading framework, maybe followed closely by AutoGen. And now we look back on it and the focus is much more on LangGraph and CrewAI. It seemed like we might see some consolidation around a smaller number of frameworks, and instead we’ve just splintered and seen an explosion of frameworks where now Amazon has released Thread, and OpenAI has released their own [framework], and Anthropic has released their own.

So the fragmentation has only increased, which ironically underscores the approach that I took of not committing too hard to one framework but really focusing on the fundamentals that would apply across each of those. The pace of model development has been really staggering—reasoning models were just coming out as I was beginning to write this book, and that has really transformed the way we do software engineering, and it’s really increased the capabilities for these types of agentic design patterns.

So, in some ways, both more and less changed than I expected. I think the fundamentals and core content are looking more durable. I’m excited to see how that’s going to benefit people and readers going forward.

Absolutely. Absolutely. Thinking about readers, I think you may have gotten some guidance from our editorial team to really think about “Who is your ideal reader?” and focus on them as opposed to trying to reach too broad of an audience. But there are a lot of people at this moment who are interested in this topic from all different places. So I’m just wondering how you thought about your audience when you were writing?

My target audience has always been software engineers who want to increasingly use AI and build increasingly sophisticated systems, and who want to do it to solve real work and want to do this for individual projects or projects for their organizations and teams. I didn’t anticipate just how many companies were going to rebrand the work they’re doing as agents and really focus on these agentic solutions that are much more off-the-shelf. And so what I’m focused on is really understanding these patterns and learning how you can build it from the ground up. What’s exciting to see is as these models keep getting better, it’s really enabling more teams to build on this pattern.

And so I’m glad to see that there’s great tooling out there to make it easier, but I think it’s really helpful to be able to go and see how you build these things really from the model up effectively. And the other thing I’ll add is there’s a wide range of additional product managers and executives who can really benefit from understanding these systems better and how they can transform their organizations. On the other hand, we’ve also seen a real increase in excitement and use around low-code and no-code agent builders. Not only products that are off-the-shelf but also open source frameworks like Dify and n8n and the new AgentKit that OpenAI just released that really provide these types of drag-and-drop graphical interfaces. 

And of course, as I talk about in the book, agency is a spectrum: Fundamentally it’s about putting some degree of choice within the hands of a language model. And these sort of guardrailed, highly defined systems—they’re less agentic than providing a full language model with memory and with learning and with tools and potentially with self-improvement. But they still offer the opportunity for people to do very real work. 

What this book really is helpful for then is for this growing audience of low-code and no-code users to better understand how they could take those systems to the next level and translate those low-code versions into code versions. The growing use of coding models—things like Claude Code and GitHub Copilot—are just lowering the bar so dramatically to make it easier for ordinary folks who have less of a technical background to still be able to build really incredible solutions. This book can really serve [as], if not a gateway, then a really effective ramp to go from some of those early pilots and early projects onto things that are a little bit more hardened that they could actually ship to production.

So to reflect a little bit more on the process, what was one of the most formidable hurdles that you came across during the process of writing, and how did you overcome it? How do you think that ended up shaping the final book?

I think probably the most significant hurdle was just keeping up with some of the additional changes on the frameworks. Just making sure that the code that I was writing was still going to have enduring value.

As I was taking a second pass through the code I had written, some of it was already out of date. And so really continuously updating and improving and pulling to the latest models and upgrading to the latest APIs, just that underlying change that is happening. Anyone in the industry is feeling that the pace of change is increasing over time—and so really just keeping up with that. The best way that I managed that was just constant learning, following closely what was happening and making sure that I was including some of the latest research findings to ensure that it was going to be as current and as relevant as possible when it went to print so it would be as valuable as possible. 

If you could give one piece of advice to an aspiring author, what would that be?

Do it! I grew up loving books. They really have spoken to me so many times and in so many ways. And I knew that I wanted to write a book. I think many more people out there probably want to write a book than have written a book. So I would just say, you can! And please, even if your book does not do particularly well, there is an audience out there for it. Everyone has a unique perspective and a unique background and something unique to offer, and we all benefit from more of those ideas being put into print and being shared out with the larger world.

I will say, it is more work than I expected. I knew it was going to be a lot, but there’s so many drafts you want to go through. And I think as you spend time with it, it’s easy to write the first draft. It’s very hard to say this is good enough because nothing is ever perfect. Many of us have a perfectionist streak. We want to make things better. It’s very hard to say, “All right, I’m gonna stop here.” I think if you talk to many other writers, they also know their work is imperfect.

And it takes an interesting discipline to both keep putting in that work to make it as good as you possibly can and also the countervailing discipline to say this is enough, and I’m going to share this with the world and I can go and work on the next thing.

That’s a great message. Both positive and encouraging but also real, right? Just to switch gears to think a little bit more about agentic systems and where we are today: Was there anything you learned or saw or that developed about agentic systems during this process of writing the book that was really surprising or unexpected?

Honestly, it is the pace of improvement in these models. For folks who are not watching the research all that closely, it can just look like one press release after another. And especially for folks who are not based in Seattle or Silicon Valley or the hubs where this is what people are talking about and watching, it can seem like not a lot has changed since ChatGPT came out. [But] if you’re really watching the progress on these models over time, it is really impressive—the shift from supervised fine-tuning and reinforcement learning with human feedback over to reinforcement learning with verifiable rewards, and the shift to these reasoning models and recognizing that reasoning is scaling and that we need more environments and more high-quality graders. And as we keep building those out and training bigger models for longer, we’re seeing better performance over time and we can then distill that incredible performance out to smaller models. So the expectations are inflating really quickly. 

I think what’s happening is we’re judging each release against these very high expectations. And so sometimes people are disappointed with any individual release, but what we’re missing is this exponential compounding of performance that’s happening over time, where if you look back over three and six and nine and 12 months, we are seeing things change in really incredible ways. And I’d especially point to the coding models, led especially by Anthropic’s Claude, but also Codex and Gemini are really good. And even among the very best developers, the percentage of code that they are writing by hand is going down over time. It’s not that their skill or expertise is less required. It’s just that it is required to fix fewer and fewer things. This means that teams can move much much faster and build in much more efficient ways. I think we’ve seen such progress on the models and software because we have so much training data and we can build such clear verifiers and graders. And so you can just keep tuning those models on that forever.

What we’re seeing now is an extension out to additional problems in healthcare, in law, in biology, in physics. And it takes a real investment to build those additional verifiers and graders and training data. But I think we’re going to continue to see some really impressive breakthroughs across a range of different sectors. And that’s very exciting—it’s really going to transform a number of industries.

You’ve touched on others’ expectations a little bit. You speak a lot at events and give talks and so on, and you’re out there in the world learning about what people think or assume about agentic systems. Are there any common misconceptions that you’ve come across? How do you respond to or address them?

So many misconceptions. Maybe the most fundamental one is that I do see some slightly delusional thinking about considering [LLMs] to be like people. Software engineers tend to think in terms of incremental progress; we want to look for a number that we can optimize and we make it better, and that’s really how we’ve gotten here. 

One wonderful way I’ve heard [it described] is that these are thinking rocks. We are still multiplying matrices and predicting tokens. And I would just encourage folks to focus on specific problems and see how well the models work. And it will work for some things and not for others. And there’s a range of techniques that you can use to improve it, but to just take a very skeptical and empirical and pragmatic approach and use the technology and tools that we have to solve problems that people care about. 

I see a fair bit of leaping to, “Can we just have an agent diagnose all of the problems on your computer for you? Can we just get an agent to do that type of thinking?” And maybe in the distant future that will be great. But really the field is driven by smart people working hard to move the numbers just a couple points at a time, and that compounds. And so I would just encourage people to think about these as very powerful and useful tools, but fundamentally they are models that predict tokens and we can use them to solve problems, and to really think about it in that pragmatic way.

What do you see as the sort of one or some of the most significant current trends in the field, or even challenges? 

One of the biggest open questions right now is just how much big research labs training big expensive frontier models will be able to solve these big problems in generalizable ways as opposed to this countervailing trend of more teams doing fine-tuning. Both are really powerful and effective. 

Looking back over the last 12 months, the improvements in the small models have been really staggering. And three billion-parameter models getting very close to what 500 billion- and trillion-parameter models were doing not that many months ago. So when you have these smaller models, it’s much more feasible for ordinary startups and Fortune 500s and potentially even small and medium-sized businesses to take some of their data and fine-tune a model to better understand their domain, their context, how that business operates. . .

That’s something that’s really valuable to many teams: to own the training pipeline and be able to customize their models and potentially customize the agents that they build on top of that and really drive those closed learning feedback loops. So now you have this agent solve this task, you collect the data from it, you grade it, and you can fine-tune the model to do that. Mira Murati’s Thinking Machines is really targeted, thinking that fine-tuning is the future. That’s a promising direction. 

But what we’ve also seen is that big models can generalize. The big research labs—OpenAI and xAI and Anthropic and Google—are certainly investing heavily in a large number of training environments and a large number of graders, and they are getting better at a broad range of tasks over time. [It’s an open question] just how much those big models will continue to improve and whether they’ll get good enough fast enough for every company. Of course, the labs will say, “Use the models by API. Just trust that they’ll get better over time and just cut us large checks for all of your use cases over time.” So, as has always been the case, if you’re a smaller company with less traffic, go and use the big providers. But if you’re someone like a Perplexity or a Cursor that has a tremendous amount of volume, it’s probably going to make sense to own your own model. The cost per inference of ownership is going to be much lower.

What I suspect is that the threshold will come down over time—that it will also make sense for medium-sized tech companies and maybe for the Fortune 500 in various use cases and increasingly small and medium-sized businesses to have their own models. Healthy tension and competition between the big labs and having good tools for small companies to own and customize their own models is going to be a really interesting question to watch over time, especially as the core base small models keep getting better and give you sort of a better foundation to start from. And companies do love owning their own data and using those training ecosystems to provide a sort of differentiated intelligence and differentiated value.

You’ve talked a bit before about keeping up with all of these technological changes that are happening so quickly. In relation to that, I wanted to ask how do you stay updated? You mentioned reading papers, but what resources do you find useful personally, just for everyone out there to know more about your process.

Yeah. One of them is just going straight to Google Scholar and arXiv. I have a couple key topics that are very interesting to me, and I search those regularly. 

LinkedIn is also fantastic. It is just fun to get connected to more people in the industry and watch the work that they’re sharing and publishing. I just find that smart people share very smart things on LinkedIn—it’s just an incredible feat of information. And then for all its pros and cons, X remains a really high-quality resource. It’s where so many researchers are, and there are great conversations happening there. So I love those as sort of my main feeds.

To close, would you like to talk about anything interesting that you’re working on now?

I recently was part of a team that launched something that we call Autotune. Microsoft just launched pilot agents: a way you can design and configure an agent to go and automate your instant investigation, your threat hunting, and help you protect your organization more easily and more safely. As part of this, we just shipped a new feature called Autotune, which will help you design and configure your agent automatically. And it can also then take feedback from how that agent is performing in your environment and update it over time. And we’re going to continue to build on that. 

There are some exciting new directions we’re going where we think we might be able to make this technology be available to more people. So stay tuned for that. And then we’re pushing an additional level of intelligence that combines Bayesian hyperparameter tuning with this prompt optimization that can help with automated model selection and help configure and improve your agent as it operates in production in real time. We think this type of self-learning is going to be really valuable and is going to help more teams receive more value from the agents that are designing and shipping.

That sounds great! Thank you, Michael.

The End of Debugging

10 December 2025 at 07:18

The following article originally appeared on Medium and is being republished here with the author’s permission.

This post is a follow-up to a post from last week on the progress of logging. A colleague pushed back on the idea that we’d soon be running code we don’t fully understand. He was skeptical: “We’ll still be the ones writing the code, right? You can only support the code if you wrote it, right?…right?”

That’s the assumption—but it’s already slipping.

You Don’t Have to Write (or Even Read) Every Line Anymore

I gave him a simple example. I needed drag-and-drop ordering in a form. I’ve built it before, but this time I asked Cursor: “Take this React component, make the rows draggable, persist the order, and generate tests.”

It did. I ran the tests, and everything passed; I then shipped the feature without ever opening the code. Not because I couldn’t but because I didn’t have to. That doesn’t mean I always ship this way. Most of the time, I still review, but it’s becoming more common that I don’t need to.

And this isn’t malpractice or vibe coding. The trust comes from two things: I know I can debug and fix if something goes wrong, and I have enough validation to know when the output is solid. If the code works, passes tests, and delivers the feature, I don’t need to micromanage every line of code. That shift is already here—and it’s only accelerating.

Already Comfortable Ceding Control

Which brings me back to site reliability. Production systems are on the same trajectory. We’re walking into a world where the software is watching itself, anticipating failures, and quietly fixing them before a human would ever notice. Consider how Airbus advises pilots to keep the autopilot on during turbulence. Computers don’t panic or overcorrect; they ride it out smoothly. That’s what’s coming for operations—systems that absorb the bumps without asking you to grab the controls.

This shift doesn’t eliminate humans, but it does change the work. We won’t be staring at charts all day, because the essential decisions won’t be visible in dashboards. Vendors like Elastic, Grafana, and Splunk won’t vanish, but they’ll need to reinvent their value in a world where the software is diagnosing and correcting itself before alerts even fire.

And this happens faster than you think. Not because the technology matures slowly and predictably, but because the incentives are brutal: The first companies to eliminate downtime and pager duty will have an unassailable advantage, and everyone else will scramble to follow. Within a couple of years (sorry, I meant weeks), the default assumption will be that you’re building for an MCP—the standard machine control plane that consumes your logs, interprets your signals, and acts on your behalf. If you’re not writing for it, you’ll be left behind.

More Powerful Primitives (We May Not Fully Understand)

I’ll end with this. I majored in computer engineering. I know how to design an 8-bit microprocessor on FPGAs. . .in the late 1990s. Do you think I fully understand the Apple M4 chip in the laptop I’m writing on? Conceptually, yes—I understand the principles. But I don’t know everything it’s doing, instruction by instruction. And that’s fine.

We already accept that kind of abstraction all the time. As Edsger W. Dijkstra said: “The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.” Abstractions give us new building blocks—smaller, sharper units of thought—that let us stop worrying about every transistor and instead design at the level of processors, operating systems, or languages.

Code generation is about to redefine that building block again. It’s not just another abstraction layer; it’s a new “atom” for how we think about software. Once that shift takes hold, we’ll start leveling up—not because we know less but because we’ll be working with more powerful primitives.

Software 2.0 Means Verifiable AI

9 December 2025 at 07:23

Quantum computing (QC) and AI have one thing in common: They make mistakes.

There are two keys to handling mistakes in QC: We’ve made tremendous progress in error correction in the last year. And QC focuses on problems where generating a solution is extremely difficult, but verifying it is easy. Think about factoring 2048-bit prime numbers (around 600 decimal digits). That’s a problem that would take years on a classical computer, but a quantum computer can solve it quickly—with a significant chance of an incorrect answer. So you have to test the result by multiplying the factors to see if you get the original number. Multiply two 1024-bit numbers? Easy, very easy for a modern classical computer. And if the answer’s wrong, the quantum computer tries again.

One of the problems with AI is that we often shoehorn it into applications where verification is difficult. Tim Bray recently read his AI-generated biography on Grokipedia. There were some big errors, but there were also many subtle errors that no one but him would detect. We’ve all done the same, with one chat service or another, and all had similar results. Worse, some of the sources referenced in the biography purporting to verify claims actually “entirely fail to support the text,”—a well-known problem with LLMs.

Andrej Karpathy recently proposed a definition for Software 2.0 (AI) that places verification at the center. He writes: “In this new programming paradigm then, the new most predictive feature to look at is verifiability. If a task/job is verifiable, then it is optimizable directly or via reinforcement learning, and a neural net can be trained to work extremely well.” This formulation is conceptually similar to quantum computing, though in most cases verification for AI will be much more difficult than verification for quantum computers. The minor facts of Tim Bray’s life are verifiable, but what does that mean? That a verification system has to contact Tim to verify the details before authorizing a bio? Or does it mean that this kind of work should not be done by AI?  Although the European Union’s AI Act has laid a foundation for what AI applications should and shouldn’t do, we’ve never had anything that’s easily, well, “computable.”  Furthermore: In quantum computing it’s clear that if a machine fails to produce correct output, it’s OK to try again. The same will be true for AI; we already know that all interesting models produce different output if you ask the question again. We shouldn’t underestimate the difficulty of verification, which might prove to be more difficult than training LLMs.

Regardless of the difficulty of verification, Karpathy’s focus on verifiability is a huge step forward. Again from Karpathy: “The more a task/job is verifiable, the more amenable it is to automation…. This is what’s driving the ‘jagged’ frontier of progress in LLMs.”

 What differentiates this from Software 1.0 is simple:

Software 1.0 easily automates what you can specify.
Software 2.0 easily automates what you can verify.

That’s the challenge Karpathy lays down for AI developers: determine what is verifiable and how to verify it. Quantum computing gets off easily because we only have a small number of algorithms that solve straightforward problems, like factoring large numbers. Verification for AI won’t be easy, but it will be necessary as we move into the future.

What If? AI in 2026 and Beyond

8 December 2025 at 12:58

The market is betting that AI is an unprecedented technology breakthrough, valuing Sam Altman and Jensen Huang like demigods already astride the world. The slow progress of enterprise AI adoption from pilot to production, however, still suggests at least the possibility of a less earthshaking future. Which is right?

At O’Reilly, we don’t believe in predicting the future. But we do believe you can see signs of the future in the present. Every day, news items land, and if you read them with a kind of soft focus, they slowly add up. Trends are vectors with both a magnitude and a direction, and by watching a series of data points light up those vectors, you can see possible futures taking shape.

This is how we’ve always identified topics to cover in our publishing program, our online learning platform, and our conferences. We watch what we call “the alpha geeks“: paying attention to hackers and other early adopters of technology with the conviction that, as William Gibson put it, “The future is here, it’s just not evenly distributed yet.” As a great example of this today, note how the industry hangs on every word from AI pioneer Andrej Karpathy, hacker Simon Willison, and AI-for-business guru Ethan Mollick.

We are also fans of a discipline called scenario planning, which we learned decades ago during a workshop with Lawrence Wilkinson about possible futures for what is now the O’Reilly learning platform. The point of scenario planning is not to predict any future but rather to stretch your imagination in the direction of radically different futures and then to identify “robust strategies” that can survive either outcome. Scenario planners also use a version of our “watching the alpha geeks” methodology. They call it “news from the future.”

Is AI an Economic Singularity or a Normal Technology?

For AI in 2026 and beyond, we see two fundamentally different scenarios that have been competing for attention. Nearly every debate about AI, whether about jobs, about investment, about regulation, or about the shape of the economy to come, is really an argument about which of these scenarios is correct.

Scenario one: AGI is an economic singularity. AI boosters are already backing away from predictions of imminent superintelligent AI leading to a complete break with all human history, but they still envision a fast takeoff of systems capable enough to perform most cognitive work that humans do today. Not perfectly, perhaps, and not in every domain immediately, but well enough, and improving fast enough, that the economic and social consequences will be transformative within this decade. We might call this the economic singularity (to distinguish it from the more complete singularity envisioned by thinkers from John von Neumann, I. J. Good, and Vernor Vinge to Ray Kurzweil).

In this possible future, we aren’t experiencing an ordinary technology cycle. We are experiencing the start of a civilization-level discontinuity. The nature of work changes fundamentally. The question is not which jobs AI will take but which jobs it won’t. Capital’s share of economic output rises dramatically; labor’s share falls. The companies and countries that master this technology first will gain advantages that compound rapidly.

If this scenario is correct, most of the frameworks we use to think about technology adoption are wrong, or at least inadequate. The parallels to previous technology transitions such as electricity, the internet, or mobile are misleading because they suggest gradual diffusion and adaptation. What’s coming will be faster and more disruptive than anything we’ve experienced.

Scenario two: AI is a normal technology. In this scenario, articulated most clearly by Arvind Narayanan and Sayash Kapoor of Princeton, AI is a powerful and important technology but nonetheless subject to all the normal dynamics of adoption, integration, and diminishing returns. Even if we develop true AGI, adoption will still be a slow process. Like previous waves of automation, it will transform some industries, augment many workers, displace some, but most importantly, take decades to fully diffuse through the economy.

In this world, AI faces the same barriers that every enterprise technology faces: integration costs, organizational resistance, regulatory friction, security concerns, training requirements, and the stubborn complexity of real-world workflows. Impressive demos don’t translate smoothly into deployed systems. The ROI is real but incremental. The hype cycle does what hype cycles do: Expectations crash before realistic adoption begins.

If this scenario is correct, the breathless coverage and trillion-dollar valuations are symptoms of a bubble, not harbingers of transformation.

Reading News from the Future

These two scenarios lead to radically different conclusions. If AGI is an economic singularity, then massive infrastructure investment is rational, and companies borrowing hundreds of billions to spend on data centers to be used by companies that haven’t yet found a viable economic model are making prudent bets. If AI is a normal technology, that spending looks like the fiber-optic overbuild of 1999. It’s capital that will largely be written off.

If AGI is an economic singularity, then workers in knowledge professions should be preparing for fundamental career transitions; firms should be thinking how to radically rethink their products, services, and business models; and societies should be planning for disruptions to employment, taxation, and social structure that dwarf anything in living memory.

If AI is normal technology, then workers should be learning to use new tools (as they always have), but the breathless displacement predictions will join the long list of automation anxieties that never quite materialized.

So, which scenario is correct? We don’t know yet, or even if this face-off is the right framing of possible futures, but we do know that a year or two from now, we will tell ourselves that the answer was right there, in plain sight. How could we not have seen it? We weren’t reading the news from the future.

Some news is hard to miss: The change in tone of reporting in the financial markets, and perhaps more importantly, the change in tone from Sam Altman and Dario Amodei. If you follow tech closely, it’s also hard to miss news of real technical breakthroughs, and if you’re involved in the software industry, as we are, it’s hard to miss the real advances in programming tools and practices. There’s also an area that we’re particularly interested in, one which we think tells us a great deal about the future, and that is market structure, so we’re going to start there.

The Market Structure of AI

The economic singularity scenario has been framed as a winner-takes-all race for AGI that creates a massive concentration of power and wealth. The normal technology scenario suggests much more of a rising tide, where the technology platforms become dominant precisely because they create so much value for everyone else. Winners emerge over time rather than with a big bang.

Quite frankly, we have one big signal that we’re watching here: Does OpenAI, Anthropic, or Google first achieve product-market fit? By product-market fit we don’t just mean that users love the product or that one company has dominant market share but that a company has found a viable economic model, where what people are willing to pay for AI-based services is greater than the cost of delivering them.

OpenAI appears to be trying to blitzscale its way to AGI, building out capacity far in excess of the company’s ability to pay for it. This is a massive one-way bet on the economic singularity scenario, which makes ordinary economics irrelevant. Sam Altman has even said that he has no idea what his business will be post-AI or what the economy will look like. So far, investors have been buying it, but doubts are beginning to shape their decisions.

Anthropic is clearly in pursuit of product-market fit, and its success in one target market, software development, is leading the company on a shorter and more plausible path to profitability. Anthropic leaders talk AGI and economic singularity, but they walk the walk of a normal technology believer. The fact that Anthropic is likely to beat OpenAI to an IPO is a very strong normal technology signal. It’s also a good example of what scenario planners view as a robust strategy, good in either scenario.

Google gives us a different take on normal technology: an incumbent looking to balance its existing business model with advances in AI. In Google’s normal technology vision, AI disappears “into the walls” like networks did. Right now, Google is still foregrounding AI with AI overviews and NotebookLM, but it’s in a position to make it recede into the background of its entire suite of products, from Search and Google Cloud to Android and Google Docs. It has too much at stake in the current economy to believe that the route to the future consists in blowing it all up. That being said, Google also has the resources to place big bets on new markets with clear economic potential, like self-driving cars, drug discovery, and even data centers in space. It’s even competing with Nvidia, not just with OpenAI and Anthropic. This is also a robust strategy.

What to watch for: What tech stack are developers and entrepreneurs building on?

Right now, Anthropic’s Claude appears to be winning that race, though that could change quickly. Developers are increasingly not locked into a proprietary stack but are easily switching based on cost or capability differences. Open standards such as MCP are gaining traction.

On the consumer side, Google Gemini is gaining on ChatGPT in terms of daily active users, and investors are starting to question OpenAI’s lack of a plausible business model to support its planned investments.

These developments suggest that the key idea behind the massive investment driving AI boom, that one winner gets all the advantages, just doesn’t hold up.

Capability Trajectories

The economic singularity scenario depends on capabilities continuing to improve rapidly. The normal technology scenario is comfortable with limits rather than hyperscaled discontinuity. There is already so much to digest!

On the economic singularity side of the ledger, positive signs would include a capability jump that surprises even insiders, such as Yann LeCun’s objections being overcome. That is, AI systems demonstrably have world models, can reason about physics and causality, and aren’t just sophisticated pattern matchers. Another game changer would be a robotics breakthrough: embodied AI that can navigate novel physical environments and perform useful manipulation tasks.

Evidence that AI is normal technology include AI systems that are good enough to be useful but not good enough to be trusted, continuing to require human oversight that limits productivity gains; prompt injection and security vulnerabilities remain unsolved, constraining what agents can be trusted to do; domain complexity continues to defeat generalization, and what works in coding doesn’t transfer to medicine, law, science; regulatory and liability barriers prove high enough to slow adoption regardless of capability; and professional guilds successfully protect their territory. These problems may be solved over time, but they don’t just disappear with a new model release.

Regard benchmark performance with skepticism, since benchmarks are even more likely to be gamed when investors are losing enthusiasm than they are now, while everyone is still afraid of missing out.

Reports from practitioners actually deploying AI systems are far more important. Right now, tactical progress is strong. We see software developers in particular making profound changes in development workflows. Watch for whether they are seeing continued improvement or a plateau. Is the gap between demo and production narrowing or persisting? How much human oversight do deployed systems require? Listen carefully to reports from practitioners about what AI can actually do in their domain versus what it’s hyped to do.

We are not persuaded by surveys of corporate attitudes. Having lived through the realities of internet and open source software adoption, we know that, like Hemingway’s marvelous metaphor of bankruptcy, corporate adoption happens gradually, then suddenly, with late adopters often full of regret.

If AI is achieving general intelligence, though, we should see it succeed across multiple domains, not just the ones where it has obvious advantages. Coding has been the breakout application, but coding is in some ways the ideal domain for current AI. It’s characterized by well-defined problems, immediate feedback loops, formally defined languages, and massive training data. The real test is whether AI can break through in domains that are harder and farther away from the expertise of the people developing the AI models.

What to watch for: Real-world constraints start to bite. For example, what if there is not enough power to train or run the next generation of models at the scale company ambitions require? What if capital for the AI build-out dries up?

Our bet is that various real-world constraints will become more clearly recognized as limits to the adoption of AI, despite continued technical advances.

Bubble or Bust?

It’s hard not to notice how the narrative in the financial press has shifted in the past few months, from mindless acceptance of industry narratives to a growing consensus that we are in the throes of a massive investment bubble, with the chief question on everyone’s mind seeming to be when and how it will pop.

The current moment does bear uncomfortable similarities to previous technology bubbles. Famed short investor Michael Burry is comparing Nvidia to Cisco and warning of a worse crash than the dot-com bust of 2000. The circular nature of AI investment—in which Nvidia invests in OpenAI, which buys Nvidia chips; Microsoft invests in OpenAI, which pays Microsoft for Azure; and OpenAI commits to massive data center build-outs with little evidence that it will ever have enough profit to justify those commitments—has reached levels that would be comical if the numbers weren’t so large.

But there’s a counterargument: Every transformative infrastructure build-out begins with a bubble. The railroads of the 1840s, the electrical grid of the 1900s, the fiber-optic networks of the 1990s all involved speculative excess, but all left behind infrastructure that powered decades of subsequent growth. One question is whether AI infrastructure is like the dot-com bubble (which left behind useful fiber and data centers) or the housing bubble (which left behind empty subdivisions and a financial crisis).

The real question when faced with a bubble is What will be the source of value in what is left? It most likely won’t be in the AI chips, which have a short useful life. It may not even be in the data centers themselves. It may be in a new approach to programming that unlocks entirely new classes of applications. But one pretty good bet is that there will be enduring value in the energy infrastructure build-out. Given the Trump administration’s war on renewable energy, the market demand for energy in the AI build-out may be its saving grace. A future of abundant, cheap energy rather than the current fight for access that drives up prices for consumers could be a very nice outcome.

Signs pointing toward economic singularity: Widespread job losses across multiple industries and spiking business bankruptcy rate; storied companies are wiped out by major new applications that just couldn’t exist without AI; sustained high utilization of AI infrastructure (data centers, GPU clusters) over multiple years; actual demand meets or exceeds capacity; continued spiking of energy prices, especially in areas with many data centers.

Signs pointing toward bubble: Continued reliance on circular financing structures (vendor financing, equity swaps between AI companies); enterprise AI projects stall in the pilot phase, failing to scale; a “show me the money” moment arrives, where investors demand profitability and AI companies can’t deliver.

Signs pointing towards normal technology recovery postbubble: Strong revenue growth at AI application companies, not just infrastructure providers; enterprises report concrete, measurable ROI from AI deployments.

What to watch: There are so many possibilities that this is an act of imagination! Start with Wile E. Coyote running over a cliff in pursuit of Road Runner in the classic Warner Bros. cartoons. Imagine the moment when investors realize that they are trying to defy gravity.

Going over a cliff
Image generated with Gemini and Nano Banana Pro

What made them notice? Was it the failure of a much-hyped data center project? Was it that it couldn’t get financing, that it couldn’t get completed because of regulatory constraints, that it couldn’t get enough chips, that it couldn’t get enough power, that it couldn’t get enough customers?

Imagine one or more storied AI lab or startup unable to complete its next fundraise. Imagine Oracle or SoftBank trying to get out of a big capital commitment. Imagine Nvidia announcing a revenue miss. Imagine another DeepSeek moment coming out of China.

Our bet for the most likely prick to pop the bubble is that Anthropic and Google’s success against OpenAI persuades investors that OpenAI will not be able to pay for the massive amount of data center capacity it has contracted for. Given the company’s centrality to the AGI singularity narrative, a failure of belief in OpenAI could bring down the whole web of interconnected data center bets, many of them financed by debt. But that’s not the only possibility.

Always Update Your Priors

DeepSeek’s emergence in January was a signal that the American AI establishment may not have the commanding lead it assumed. Rather than racing for AGI, China seems to be heavily betting on normal technology, building towards low-cost, efficient AI, industrial capacity, and clear markets. While claims about what DeepSeek spent on training its V3 model have been contested, training isn’t the only cost: There’s also the cost of inference and, for increasingly popular reasoning models, the cost of reasoning. And when these are taken into account, DeepSeek is very much a leader.

If DeepSeek and other Chinese AI labs are right, the US may be intent on winning the wrong race. What’s more, our conversations with Chinese AI investors reveals a much heavier tilt towards embodied AI (robotics and all its cousins) than towards consumer or even enterprise applications. Given the geopolitical tensions between China and the US, it’s worth asking what kind of advantage a GPT-9 with limited access to the real world might provide against an army of drones and robots powered by the equivalent of GPT-8!

The point is that the discussion above is meant to be provocative, not exhaustive. Expand your horizons. Think about how US and international politics, advances in other technologies, and financial market impacts ranging from a massive market collapse to a simple change in investor priorities might change industry dynamics.

What you’re watching for is not any single data point but the pattern across multiple vectors over time. Remember that the AGI versus normal technology framing is not the only or maybe even the most useful way to look at the future.

The most likely outcome, even restricted to these two hypothetical scenarios, is something in between. AI may achieve something like AGI for coding, text, and video while remaining a normal technology for embodied tasks and complex reasoning. It may transform some industries rapidly while others resist for decades. The world is rarely as neat as any scenario.

But that’s precisely why the “news from the future” approach matters. Rather than committing to a single prediction, you stay alert to the signals, ready to update your thinking as evidence accumulates. You don’t need to know which scenario is correct today. You need to recognize which scenario is becoming correct as it happens.

AI in 2026 and Beyond infographic
Infographic created with Gemini and Nano Banana Pro

What If? Robust Strategies in the Face of Uncertainty

The second part of scenario planning is to identify robust strategies that will help you do well regardless of which possible future unfolds. In this final section, as a way of making clear what we mean by that, we’ll consider 10 “What if?” questions and ask what the robust strategies might be.

1. What if the AI bubble bursts in 2026?

The vector: We are seeing massive funding rounds for AI foundries and massive capital expenditure on GPUs and data centers without a corresponding explosion in revenue for the application layer.

The scenario: The “revenue gap” becomes undeniable. Wall Street loses patience. Valuations for foundational model companies collapse and the river of cheap venture capital dries up.

In this scenario, we would see responses like OpenAI’s “Code Red” reaction to improvements in competing products. We would see declines in prices for stocks that aren’t yet traded publicly. And we might see signs that the massive fundraising for data centers and power are performative, not backed by real capital. In the words of one commenter, they are “bragawatts.”

A robust strategy: Don’t build a business model that relies on subsidized intelligence. If your margins only work because VC money is paying for 40% of your inference costs, you are vulnerable. Focus on unit economics. Build products where the AI adds value that customers are willing to pay for now, not in a theoretical future where AI does everything. If the bubble bursts, infrastructure will remain, just as the dark fiber did, becoming cheaper for the survivors to use.

2. What if energy becomes the hard limit?

The vector: Data centers are already stressing grids. We are seeing a shift from the AI equivalent of Moore’s law to a world where progress may be limited by energy constraints.

The scenario: In 2026, we hit a wall. Utilities simply cannot provision power fast enough. Inference becomes a scarce resource, available only to the highest bidders or those with private nuclear reactors. Highly touted data center projects are put on hold because there isn’t enough power to run them, and rapidly depreciating GPUs are put in storage because there aren’t enough data centers to deploy them.

A robust strategy: Efficiency is your hedge. Stop treating compute as infinite. Invest in small language models (SLMs) and edge AI that run locally. If you can run 80% of your workload on a laptop-grade chip rather than an H100 in the cloud, you are at least partially insulated from the energy crunch.

3. What if inference becomes a commodity?

The vector: Chinese labs continue to release open weight models with performance comparable to each previous generation of top-of-the line US frontier models but at a fraction of the training and inference cost. What’s more, they are training them with lower-cost chips. And it appears to be working.

The scenario: The price of “intelligence” collapses to near zero. The moat of having the biggest model and the best cutting-edge chips for training evaporates.

A robust strategy: Move up the stack. If the model is a commodity, the value is in the integration, the data, and the workflow. Build applications and services using the unique data, context, and workflows that no one else has.

4. What if Yann LeCun is right?

The vector: LeCun has long argued that auto-regressive LLMs are an “off-ramp” on the highway to AGI because they can’t reason or plan; they only predict the next token. He bets on world models (JEPA). OpenAI cofounder Ilya Sutskever has also argued that the AI industry needs fundamental research to solve basic problems like the ability to generalize.

The scenario: In 2026, LLMs hit a plateau. The market realizes we’ve spent billions on a dead end technology for true AGI.

A robust strategy: Diversify your architecture. Don’t bet the farm on today’s AI. Focus on compound AI systems that use LLMs as just one component, while relying on deterministic code, databases, and small, specialized models for additional capabilities. Keep your eyes and your options open.

5. What if there is a major security incident?

The vector: We are currently hooking insecure LLMs up to banking APIs, email, and purchasing agents. Security researchers have been screaming about indirect prompt injection for years.

The scenario: A worm spreads through email auto-replies, tricking AI agents into transferring funds or approving fraudulent invoices at scale. Trust in agentic AI collapses.

A robust strategy: “Trust but verify” is dead; use “verify then trust.” Implement well-known security practices like least privilege (restrict your agents to the minimal list of resources they need) and zero trust (require authentication before every action). Stay on top of OWASP’s lists of AI vulnerabilities and mitigations. Keep a “human in the loop” for high-stakes actions. Advocate for and adopt standard AI disclosure and audit trails. If you can’t trace why your agent did something, you shouldn’t let it handle money.

6. What if China is actually ahead?

The vector: While the US focuses on raw scale and chip export bans, China is focusing on efficiency and embedded AI in manufacturing, EVs, and consumer hardware.

The scenario: We discover that 2026’s “iPhone moment” comes from Shenzhen, not Cupertino, because Chinese companies integrated AI into hardware better while we were fighting over chatbot and agentic AI dominance.

A robust strategy: Look globally. Don’t let geopolitical narratives blind you to technical innovation. If the best open source models or efficiency techniques are coming from China, study them. Open source has always been the best way to bridge geopolitical divides. Keep your stack compatible with the global ecosystem, not just the US silo.

7. What if robotics has its “ChatGPT moment”?

The vector: End-to-end learning for robots is advancing rapidly.

The scenario: Suddenly, physical labor automation becomes as possible as digital automation.

A robust strategy: If you are in a “bits” business, ask how you can bridge to “atoms.” Can your software control a machine? How might you embody useful intelligence into your products?

8. What if vibe coding is just the start?

The vector: Anthropic and Cursor are changing programming from writing syntax to managing logic and workflow. Vibe coding lets nonprogrammers build apps by just describing what they want.

The scenario: The barrier to entry for software creation drops to zero. We see a Cambrian explosion of apps built for a single meeting or a single family vacation. Alex Komoroske calls it disposable software: “Less like canned vegetables and more like a personal farmer’s market.”

A robust strategy: In a world where AI is good enough to generate whatever code we ask for, value shifts to knowing what to ask for. Coding is much like writing: Anyone can do it, but some people have more to say than others. Programming isn’t just about writing code; it’s about understanding problems, contexts, organizations, and even organizational politics to come up with a solution. Create systems and tools that embody unique knowledge and context that others can use to solve their own problems.

9. What if AI kills the aggregator business model?

The vector: Amazon and Google make money by being the tollbooth between you and the product or information you want. If people get answers from AI, or an AI agent buys for you, it bypasses the ads and the sponsored listings, undermining the business model of internet incumbents.

The scenario: Search traffic (and ad revenue) plummets. Brands lose their ability to influence consumers via display ads. AI has destroyed the source of internet monetization and hasn’t yet figured out what will take its place.

A robust strategy: Own the customer relationship directly. If Google stops sending you traffic, you need an MCP, an API, or a channel for direct brand loyalty that an AI agent respects. Make sure your information is accessible to bots, not just humans. Optimize for agent readability and reuse.

10. What if a political backlash arrives?

The vector: The divide between the AI rich and those who fear being replaced by AI is growing.

The scenario: A populist movement targets Big Tech and AI automation. We see taxes on compute, robot taxes, or strict liability laws for AI errors.

A robust strategy: Focus on value creation, not value capture. If your AI strategy is “fire 50% of the support staff,” you are not only making a shortsighted business decision; you are painting a target on your back. If your strategy is “supercharge our staff to do things we couldn’t do before,” you are building a defensible future. Align your success with the success of both your workers and customers.

In Conclusion

The future isn’t something that happens to us; it’s something we create. The most robust strategy of all is to stop asking “What will happen?” and start asking “What future do we want to build?”

As Alan Kay once said, “The best way to predict the future is to invent it.” Don’t wait for the AI future to happen to you. Do what you can to shape it. Build the future you want to live in.

Software in the Age of AI

4 December 2025 at 07:19

In 2025 AI reshaped how teams think, build, and deliver software. We’re now at a point where “AI coding assistants have quickly moved from novelty to necessity [with] up to 90% of software engineers us[ing] some kind of AI for coding,” Addy Osmani writes. That’s a very different world to the one we were in 12 months ago. As we look ahead to 2026, here are three key trends we have seen driving change and how we think developers and architects can prepare for what’s ahead.

Evolving Coding Workflows

New AI tools changed coding workflows in 2025, enabling developers to write and work with code faster than ever before. This doesn’t mean AI is replacing developers. It’s opening up new frontiers to be explored and skills to be mastered, something we explored at our first AI Codecon in May.

AI tools in the IDE and on the command line have revived the debate about the IDE’s future, echoing past arguments (e.g., VS Code versus Vim). It’s more useful to focus on the tools’ purpose. As Kent Beck and Tim O’Reilly discussed in November, developers are ultimately responsible for the code their chosen AI tool produces. We know that LLMs “actively reward existing top tier software engineering practices” and “amplify existing expertise,” as Simon Willison has pointed out. And a good coder will “factor in” questions that AI doesn’t. Does it really matter which tool is used?

The critical transferable skill for working with any of these tools is understanding how to communicate effectively with the underlying model. AI tools generate better code if they’re given all the relevant background on a project. Managing what the AI knows about your project (context engineering) and communicating it (prompt engineering) are going to be key to doing good work.

The core skills for working effectively with code won’t change in the face of AI. Understanding code review, design patterns, debugging, testing, and documentation and applying those to the work you do with AI tools will be the differential.

The Rise of Agentic AI

With the rise of agents and Model Context Protocol (MCP) in the second half of 2025, developers gained the ability to use AI not just as a pair programmer but as an entire team of developers. The speakers at our Coding for the Agentic World live AI Codecon event in September 2025 explored new tools, workflows, and hacks that are shaping this emerging discipline of agentic AI.

Software engineers aren’t just working with single coding agents. They’re building and deploying their own custom agents, often within complex setups involving multi-agent scenarios, teams of coding agents, and agent swarms. This shift from conducting AI to orchestrating AI elevates the importance of truly understanding how good software is built and maintained.

We know that AI generates better code with context, and this is also true of agents. As with coding workflows, this means understanding context engineering is essential. However, the differential for senior engineers in 2026 will be how well they apply intermediate skills such as product thinking, advanced testing, system design, and architecture to their work with agentic systems.

AI and Software Architecture

We began 2025 with our January Superstream, Software Architecture in the Age of AI, where speaker Rebecca Parsons explored the architectural implications of AI, dryly noting that “given the pace of change, this could be out of date by Friday.” By the time of our Superstream in August, things had solidified a little more and our speakers were able to share AI-based patterns and antipatterns and explain how they intersect with software architecture. Our December 9 event will look at enterprise architecture and how architects can navigate the impact of AI on systems, processes, and governance. (Registration is still open—save your seat.) As these events show, AI has progressed from being something architects might have to consider to something that is now essential to their work.

We’re seeing successful AI-enhanced architectures using event-driven models, enabling AI agents to act on incoming triggers rather than fixed prompts. This means it’s more important than ever to understand event-driven architecture concepts and trade-offs. In 2026, topics that align with evolving architectures (evolutionary architectures, fitness functions) will also become more important as architects look to find ways to modernize existing systems for AI without derailing them. AI-native architectures will also bring new considerations and patterns for system design next year, as will the trend toward agentic AI.

As was the case for their engineer coworkers, architects still have to know the basics: when to add an agent or a microservice, how to consider cost, how to define boundaries, and how to act on the knowledge they already have. As Thomas Betts, Sarah Wells, Eran Stiller, and Daniel Bryant note on InfoQ, they also “nee[d] to understand how an AI element relates to other parts of their system: What are the inputs and outputs? How can they measure performance, scalability, cost, and other cross-functional requirements?”

Companies will continue to decentralize responsibilities across different functions this year, and AI brings new sets of trade-offs to be considered. It’s true that regulated industries remain understandably wary of granting access to their systems. They’re rolling out AI more carefully with greater guardrails and governance, but they are still rolling it out. So there’s never been a better time to understand the foundations of software architecture. It will prepare you for the complexity on the horizon.

Strong Foundations Matter

AI has changed the way software is built, but it hasn’t changed what makes good software. As we enter 2026, the most important developer and architecture skills won’t be defined by the tool you know. They’ll be defined by how effectively you apply judgment, communicate intent, and handle complexity when working with (and sometimes against) intelligent assistants and agents. AI rewards strong engineering; it doesn’t replace it. It’s an exciting time to be involved.


Join us at the Software Architecture Superstream on December 9 to learn how to better navigate the impact of AI on systems, processes, and governance. Over four hours, host Neal Ford and our lineup of experts including Metro Bank’s Anjali Jain and Philip O’Shaughnessy, Vercel’s Dom Sipowicz, Intel’s Brian Rogers, Microsoft’s Ron Abellera, and Equal Experts’ Lewis Crawford will share their hard-won insights about building adaptive, AI-ready architectures that support continuous innovation, ensure governance and security, and align seamlessly with business goals.

O’Reilly members can register here. Not a member? Sign up for a 10-day free trial before the event to attend—and explore all the other resources on O’Reilly.

AI Agents Need Guardrails

3 December 2025 at 07:13

When AI systems were just a single model behind an API, life felt simpler. You trained, deployed, and maybe fine-tuned a few hyperparameters.

But that world’s gone. Today, AI feels less like a single engine and more like a busy city—a network of small, specialized agents constantly talking to each other, calling APIs, automating workflows, and making decisions faster than humans can even follow.

And here’s the real challenge: The smarter and more independent these agents get, the harder it becomes to stay in control. Performance isn’t what slows us down anymore. Governance is.

How do we make sure these agents act ethically, safely, and within policy? How do we log what happened when multiple agents collaborate? How do we trace who decided what in an AI-driven workflow that touches user data, APIs, and financial transactions?

That’s where the idea of engineering governance into the stack comes in. Instead of treating governance as paperwork at the end of a project, we can build it into the architecture itself.

From Model Pipelines to Agent Ecosystems

In the old days of machine learning, things were pretty linear. You had a clear pipeline: collect data, train the model, validate it, deploy, monitor. Each stage had its tools and dashboards, and everyone knew where to look when something broke.

But with AI agents, that neat pipeline turns into a web. A single customer-service agent might call a summarization agent, which then asks a retrieval agent for context, which in turn queries an internal API—all happening asynchronously, sometimes across different systems.

It’s less like a pipeline now and more like a network of tiny brains, all thinking and talking at once. And that changes how we debug, audit, and govern. When an agent accidentally sends confidential data to the wrong API, you can’t just check one log file anymore. You need to trace the whole story: which agent called which, what data moved where, and why each decision was made. In other words, you need full lineage, context, and intent tracing across the entire ecosystem.

Why Governance Is the Missing Layer

Governance in AI isn’t new. We already have frameworks like NIST’s AI Risk Management Framework (AI RMF) and the EU AI Act defining principles like transparency, fairness, and accountability. The problem is these frameworks often stay at the policy level, while engineers work at the pipeline level. The two worlds rarely meet. In practice, that means teams might comply on paper but have no real mechanism for enforcement inside their systems.

What we really need is a bridge—a way to turn those high-level principles into something that runs alongside the code, testing and verifying behavior in real time. Governance shouldn’t be another checklist or approval form; it should be a runtime layer that sits next to your AI agents—ensuring every action follows approved paths, every dataset stays where it belongs, and every decision can be traced when something goes wrong.

The Four Guardrails of Agent Governance

Policy as code

Policies shouldn’t live in forgotten PDFs or static policy docs. They should live next to your code. By using tools like the Open Policy Agent (OPA), you can turn rules into version-controlled code that’s reviewable, testable, and enforceable. Think of it like writing infrastructure as code, but for ethics and compliance. You can define rules such as:

  • Which agents can access sensitive datasets
  • Which API calls require human review
  • When a workflow needs to stop because the risk feels too high

This way, developers and compliance folks stop talking past each other—they work in the same repo, speaking the same language.

And the best part? You can spin up a Dockerized OPA instance right next to your AI agents inside your Kubernetes cluster. It just sits there quietly, watching requests, checking rules, and blocking anything risky before it hits your APIs or data stores.

Governance stops being some scary afterthought. It becomes just another microservice. Scalable. Observable. Testable. Like everything else that matters.

Observability and auditability

Agents need to be observable not just in performance terms (latency, errors) but in decision terms. When an agent chain executes, we should be able to answer:

  • Who initiated the action?
  • What tools were used?
  • What data was accessed?
  • What output was generated?

Modern observability stacks—Cloud Logging, OpenTelemetry, Prometheus, or Grafana Loki—can already capture structured logs and traces. What’s missing is semantic context: linking actions to intent and policy.

Imagine extending your logs to capture not only “API called” but also “Agent FinanceBot requested API X under policy Y with risk score 0.7.” That’s the kind of metadata that turns telemetry into governance.

When your system runs in Kubernetes, sidecar containers can automatically inject this metadata into every request, creating a governance trace as natural as network telemetry.

Dynamic risk scoring

Governance shouldn’t mean blocking everything; it should mean evaluating risk intelligently. In an agent network, different actions have different implications. A “summarize report” request is low risk. A “transfer funds” or “delete records” request is high risk.

By assigning dynamic risk scores to actions, you can decide in real time whether to:

  • Allow it automatically
  • Require additional verification
  • Escalate to a human reviewer

You can compute risk scores using metadata such as agent role, data sensitivity, and confidence level. Cloud providers like Google Cloud Vertex AI Model Monitoring already support risk tagging and drift detection—you can extend those ideas to agent actions.

The point isn’t to slow agents down but to make their behavior context-aware.

Regulatory mapping

Frameworks like NIST AI RMF and the EU AI Act are often seen as legal mandates.
In reality, they can double as engineering blueprints.

Governance principle Engineering implementation
TransparencyAgent activity logs, explainability metadata
AccountabilityImmutable audit trails in Cloud Logging/Chronicle
RobustnessCanary testing, rollout control in Kubernetes
Risk managementReal-time scoring, human-in-the-loop review

Mapping these requirements into cloud and container tools turns compliance into configuration.

Once you start thinking of governance as a runtime layer, the next step is to design what that actually looks like in production.

Building a Governed AI Stack

Let’s visualize a practical, cloud native setup—something you could deploy tomorrow.

[Agent Layer]

[Governance Layer]
→ Policy Engine (OPA)
→ Risk Scoring Service
→ Audit Logger (Pub/Sub + Cloud Logging)

[Tool / API Layer]
→ Internal APIs, Databases, External Services

[Monitoring + Dashboard Layer]
→ Grafana, BigQuery, Looker, Chronicle

All of these can run on Kubernetes with Docker containers for modularity. The governance layer acts as a smart proxy—it intercepts agent calls, evaluates policy and risk, then logs and forwards the request if approved.

In practice:

  • Each agent’s container registers itself with the governance service.
  • Policies live in Git, deployed as ConfigMaps or sidecar containers.
  • Logs flow into Cloud Logging or Elastic Stack for searchable audit trails.
  • A Chronicle or BigQuery dashboard visualizes high-risk agent activity.

This separation of concerns keeps things clean: Developers focus on agent logic, security teams manage policy rules, and compliance officers monitor dashboards instead of sifting through raw logs. It’s governance you can actually operate—not bureaucracy you try to remember later.

Lessons from the Field

When I started integrating governance layers into multi-agent pipelines, I learned three things quickly:

  1. It’s not about more controls—it’s about smarter controls.
    When all operations have to be manually approved, you will paralyze your agents. Focus on automating the 90% that’s low risk.
  2. Logging everything isn’t enough.
    Governance requires interpretable logs. You need correlation IDs, metadata, and summaries that map events back to business rules.
  3. Governance has to be part of the developer experience.
    If compliance feels like a gatekeeper, developers will route around it. If it feels like a built-in service, they’ll use it willingly.

In one real-world deployment for a financial-tech environment, we used a Kubernetes admission controller to enforce policy before pods could interact with sensitive APIs. Each request was tagged with a “risk context” label that traveled through the observability stack. The result? Governance without friction. Developers barely noticed it—until the compliance audit, when everything just worked.

Human in the Loop, by Design

Despite all the automation, people should also be involved in making some decisions. A healthy governance stack knows when to ask for help. Imagine a risk-scoring service that occasionally flags “Agent Alpha has exceeded transaction threshold three times today.” As an alternative to blocking, it may forward the request to a human operator via Slack or an internal dashboard. That is not a weakness but a good indication of maturity when an automated system requires a person to review it. Reliable AI does not imply eliminating people; it means knowing when to bring them back in.

Avoiding Governance Theater

Every company wants to say they have AI governance. But there’s a difference between governance theater—policies written but never enforced—and governance engineering—policies turned into running code.

Governance theater produces binders. Governance engineering produces metrics:

  • Percentage of agent actions logged
  • Number of policy violations caught pre-execution
  • Average human review time for high-risk actions

When you can measure governance, you can improve it. That’s how you move from pretending to protect systems to proving that you do. The future of AI isn’t just about building smarter models; it’s about building smarter guardrails. Governance isn’t bureaucracy—it’s infrastructure for trust. And just as we’ve made automated testing part of every CI/CD pipeline, we’ll soon treat governance checks the same way: built in, versioned, and continuously improved.

True progress in AI doesn’t come from slowing down. It comes from giving it direction, so innovation moves fast but never loses sight of what’s right.

What MCP and Claude Skills Teach Us About Open Source for AI

3 December 2025 at 03:58

The debate about open source AI has largely featured open weight models. But that’s a bit like arguing that in the PC era, the most important goal would have been to have Intel open source its chip designs. That might have been useful to some people, but it wouldn’t have created Linux, Apache, or the collaborative software ecosystem that powers the modern internet. What makes open source transformative is the ease with which people can learn from what others have done, modify it to meet their own needs, and share those modifications with others. And that can’t just happen at the lowest, most complex level of a system. And it doesn’t come easily when what you are providing is access to a system that takes enormous resources to modify, use, and redistribute. It comes from what I’ve called the architecture of participation.

This architecture of participation has a few key properties:

  • Legibility: You can understand what a component does without understanding the whole system.
  • Modifiability: You can change one piece without rewriting everything.
  • Composability: Pieces work together through simple, well-defined interfaces.
  • Shareability: Your small contribution can be useful to others without them adopting your entire stack.

The most successful open source projects are built from small pieces that work together. Unix gave us a small operating system kernel surrounded by a library of useful functions, together with command-line utilities that could be chained together with pipes and combined into simple programs using the shell. Linux followed and extended that pattern. The web gave us HTML pages you could “view source” on, letting anyone see exactly how a feature was implemented and adapt it to their needs, and HTTP connected every website as a linkable component of a larger whole. Apache didn’t beat Netscape and Microsoft in the web server market by adding more and more features, but instead provided an extension layer so a community of independent developers could add frameworks like Grails, Kafka, and Spark.

MCP and Skills Are “View Source” for AI

MCP and Claude Skills remind me of those early days of Unix/Linux and the web. MCP lets you write small servers that give AI systems new capabilities such as access to your database, your development tools, your internal APIs, or third-party services like GitHub, GitLab, or Stripe. A skill is even more atomic: a set of plain language instructions, often with some tools and resources, that teaches Claude how to do something specific. Matt Bell from Anthropic remarked in comments on a draft of this piece that a skill can be defined as “the bundle of expertise to do a task, and is typically a combination of instructions, code, knowledge, and reference materials.” Perfect.

What is striking about both is their ease of contribution. You write something that looks like the shell scripts and web APIs developers have been writing for decades. If you can write a Python function or format a Markdown file, you can participate.

This is the same quality that made the early web explode. When someone created a clever navigation menu or form validation, you could view source, copy their HTML and JavaScript, and adapt it to your site. You learned by doing, by remixing, by seeing patterns repeated across sites you admired. You didn’t have to be an Apache contributor to get the benefit of learning from others and reusing their work.

Anthropic’s MCP Registry and third-party directories like punkpeye/awesome-mcp-servers show early signs of this same dynamic. Someone writes an MCP server for Postgres, and suddenly dozens of AI applications gain database capabilities. Someone creates a skill for analyzing spreadsheets in a particular way, and others fork it, modify it, and share their versions. Anthropic still seems to be feeling its way with user contributed skills, listing in its skills gallery only those they and select partners have created, but they document how to create them, making it possible for anyone to build a reusable tool based on their specific needs, knowledge, or insights. So users are developing skills that make Claude more capable and sharing them via GitHub. It will be very exciting to see how this develops. Groups of developers with shared interests creating and sharing collections of interrelated skills and MCP servers that give models deep expertise in a particular domain will be a potent frontier for both AI and open source.

GPTs Versus Skills: Two Models of Extension

It’s worth contrasting the MCP and skills approach with OpenAI’s custom GPTs, which represent a different vision of how to extend AI capabilities.

GPTs are closer to apps. You create one by having a conversation with ChatGPT, giving it instructions and uploading files. The result is a packaged experience. You can use a GPT or share it for others to use, but they can’t easily see how it works, fork it, or remix pieces of it into their own projects. GPTs live in OpenAI’s store, discoverable and usable but ultimately contained within the OpenAI ecosystem.

This is a valid approach, and for many use cases, it may be the right one. It’s user-friendly. If you want to create a specialized assistant for your team or customers, GPTs make that straightforward.

But GPTs aren’t participatory in the open source sense. You can’t “view source” on someone’s GPT to understand how they got it to work well. You can’t take the prompt engineering from one GPT and combine it with the file handling from another. You can’t easily version control GPTs, diff them, or collaborate on them the way developers do with code. (OpenAI offers team plans that do allow collaboration by a small group using the same workspace, but this is a far cry from open source–style collaboration.)

Skills and MCP servers, by contrast, are files and code. A skill is literally just a Markdown document you can read, edit, fork, and share. An MCP server is a GitHub repository you can clone, modify, and learn from. They’re artifacts that exist independently of any particular AI system or company.

This difference matters. The GPT Store is an app store, and however rich it becomes, an app store remains a walled garden. The iOS App Store and Google Play store host millions of apps for phones, but you can’t view source on an app, can’t extract the UI pattern you liked, and can’t fork it to fix a bug the developer won’t address. The open source revolution comes from artifacts you can inspect, modify, and share: source code, markup languages, configuration files, scripts. These are all things that are legible not just to computers but to humans who want to learn and build.

That’s the lineage skills and MCP belong to. They’re not apps; they’re components. They’re not products; they’re materials. The difference is architectural, and it shapes what kind of ecosystem can grow around them.

Nothing prevents OpenAI from making GPTs more inspectable and forkable, and nothing prevents skills or MCP from becoming more opaque and packaged. The tools are young. But the initial design choices reveal different instincts about what kind of participation matters. OpenAI seems deeply rooted in the proprietary platform model. Anthropic seems to be reaching for something more open.1

Complexity and Evolution

Of course, the web didn’t stay simple. HTML begat CSS, which begat JavaScript frameworks. View source becomes less useful when a page is generated by megabytes of minified React.

But the participatory architecture remained. The ecosystem became more complex, but it did so in layers, and you can still participate at whatever layer matches your needs and abilities. You can write vanilla HTML, or use Tailwind, or build a complex Next.js app. There are different layers for different needs, but all are composable, all shareable.

I suspect we’ll see a similar evolution with MCP and skills. Right now, they’re beautifully simple. They’re almost naive in their directness. That won’t last. We’ll see:

  • Abstraction layers: Higher-level frameworks that make common patterns easier.
  • Composition patterns: Skills that combine other skills, MCP servers that orchestrate other servers.
  • Optimization: When response time matters, you might need more sophisticated implementations.
  • Security and safety layers: As these tools handle sensitive data and actions, we’ll need better isolation and permission models.

The question is whether this evolution will preserve the architecture of participation or whether it will collapse into something that only specialists can work with. Given that Claude itself is very good at helping users write and modify skills, I suspect that we are about to experience an entirely new frontier of learning from open source, one that will keep skill creation open to all even as the range of possibilities expands.

What Does This Mean for Open Source AI?

Open weights are necessary but not sufficient. Yes, we need models whose parameters aren’t locked behind APIs. But model weights are like processor instructions. They are important but not where the most innovation will happen.

The real action is at the interface layer. MCP and skills open up new possibilities because they create a stable, comprehensible interface between AI capabilities and specific uses. This is where most developers will actually participate. Not only that, it’s where people who are not now developers will participate, as AI further democratizes programming. At bottom, programming is not the use of some particular set of “programming languages.” It is the skill set that starts with understanding a problem that the current state of digital technology can solve, imagining possible solutions, and then effectively explaining to a set of digital tools what we want them to help us do. The fact that this may now be possible in plain language rather than a specialized dialect means that more people can create useful solutions to the specific problems they face rather than looking only for solutions to problems shared by millions. This has always been a sweet spot for open source. I’m sure many people have said this about the driving impulse of open source, but I first heard it from Eric Allman, the creator of Sendmail, at what became known as the open source summit in 1998: “scratching your own itch.” And of course, history teaches us that this creative ferment often leads to solutions that are indeed useful to millions. Amateur programmers become professionals, enthusiasts become entrepreneurs, and before long, the entire industry has been lifted to a new level.

Standards enable participation. MCP is a protocol that works across different AI systems. If it succeeds, it won’t be because Anthropic mandates it but because it creates enough value that others adopt it. That’s the hallmark of a real standard.

Ecosystems beat models. The most generative platforms are those in which the platform creators are themselves part of the ecosystem. There isn’t an AI “operating system” platform yet, but the winner-takes-most race for AI supremacy is based on that prize. Open source and the internet provide an alternate, standards-based platform that not only allows people to build apps but to extend the platform itself.

Open source AI means rethinking open source licenses. Most of the software shared on GitHub has no explicit license, which means that default copyright laws apply: The software is under exclusive copyright, and the creator retains all rights. Others generally have no right to reproduce, distribute, or create derivative works from the code, even if it is publicly visible on GitHub. But as Shakespeare wrote in The Merchant of Venice, “The brain may devise laws for the blood, but a hot temper leaps o’er a cold decree.” Much of this code is de facto open source, even if not de jure. People can learn from it, easily copy from it, and share what they’ve learned.

But perhaps more importantly for the current moment in AI, it was all used to train LLMs, which means that this de facto open source code became a vector through which all AI-generated code is created today. This, of course, has made many developers unhappy, because they believe that AI has been trained on their code without either recognition or recompense. For open source, recognition has always been a fundamental currency. For open source AI to mean something, we need new approaches to recognizing contributions at every level.

Licensing issues also come up around what happens to data that flows through an MCP server. What happens when people connect their databases and proprietary data flows through an MCP so that an LLM can reason about it? Right now I suppose it falls under the same license as you have with the LLM vendor itself, but will that always be true?  And, would I, as a provider of information, want to restrict the use of an MCP server depending on a specific configuration of a user’s LLM settings? For example, might I be OK with them using a tool if they have turned off “sharing” in the free version, but not want them to use it if they hadn’t? As one commenter on a draft of this essay put it, “Some API providers would like to prevent LLMs from learning from data even if users permit it. Who owns the users’ data (emails, docs) after it has been retrieved via a particular API or MCP server might be a complicated issue with a chilling effect on innovation.”

There are efforts such as RSL (Really Simple Licensing) and CC Signals that are focused on content licensing protocols for the consumer/open web, but they don’t yet really have a model for MCP, or more generally for transformative use of content by AI. For example, if an AI uses my credentials to retrieve academic papers and produces a literature review, what encumbrances apply to the results? There is a lot of work to be done here.

Open Source Must Evolve as Programming Itself Evolves

It’s easy to be amazed by the magic of vibe coding. But treating the LLM as a code generator that takes input in English or other human languages and produces Python, TypeScript, or Java echoes the use of a traditional compiler or interpreter to generate byte code. It reads what we call a “higher-level language” and translates it into code that operates further down the stack. And there’s a historical lesson in that analogy. In the early days of compilers, programmers had to inspect and debug the generated assembly code, but eventually the tools got good enough that few people need to do that any more. (In my own career, when I was writing the manual for Lightspeed C, the first C compiler for the Mac, I remember Mike Kahl, its creator, hand-tuning the compiler output as he was developing it.)

Now programmers are increasingly finding themselves having to debug the higher-level code generated by LLMs. But I’m confident that will become a smaller and smaller part of the programmer’s role. Why? Because eventually we come to depend on well-tested components. I remember how the original Macintosh user interface guidelines, with predefined user interface components, standardized frontend programming for the GUI era, and how the Win32 API meant that programmers no longer needed to write their own device drivers. In my own career, I remember working on a book about curses, the Unix cursor-manipulation library for CRT screens, and a few years later the manuals for Xlib, the low-level programming interfaces for the X Window System. This kind of programming soon was superseded by user interface toolkits with predefined elements and actions. So too, the roll-your-own era of web interfaces was eventually standardized by powerful frontend JavaScript frameworks.

Once developers come to rely on libraries of preexisting components that can be combined in new ways, what developers are debugging is no longer the lower-level code (first machine code, then assembly code, then hand-built interfaces) but the architecture of the systems they build, the connections between the components, the integrity of the data they rely on, and the quality of the user interface. In short, developers move up the stack.

LLMs and AI agents are calling for us to move up once again. We are groping our way towards a new paradigm in which we are not just building MCPs as instructions for AI agents but developing new programming paradigms that blend the rigor and predictability of traditional programming with the knowledge and flexibility of AI. As Phillip Carter memorably noted, LLMs are inverted computers relative to those with which we’ve been familiar: “We’ve spent decades working with computers that are incredible at precision tasks but need to be painstakingly programmed for anything remotely fuzzy. Now we have computers that are adept at fuzzy tasks but need special handling for precision work.” That being said, LLMs are becoming increasingly adept at knowing what they are good at and what they aren’t. Part of the whole point of MCP and skills is to give them clarity about how to use the tools of traditional computing to achieve their fuzzy aims.

Consider the evolution of agents from those based on “browser use” (that is, working with the interfaces designed for humans) to those based on making API calls (that is, working with the interfaces designed for traditional programs) to those based on MCP (relying on the intelligence of LLMs to read documents that explain the tools that are available to do a task). An MCP server looks a lot like the formalization of prompt and context engineering into components. A look at what purports to be a leaked system prompt for ChatGPT suggests that the pattern of MCP servers was already hidden in the prompts of proprietary AI apps: “Here’s how I want you to act. Here are the things that you should and should not do. Here are the tools available to you.”

But while system prompts are bespoke, MCP and skills are a step towards formalizing plain text instructions to an LLM so that they can become reusable components. In short, MCP and skills are early steps towards a system of what we can call “fuzzy function calls.”

Fuzzy Function Calls: Magic Words Made Reliable and Reusable

This view of how prompting and context engineering fit with traditional programming connects to something I wrote about recently: LLMs natively understand high-level concepts like “plan,” “test,” and “deploy”; industry standard terms like “TDD” (Test Driven Development) or “PRD” (Product Requirements Document); competitive features like “study mode”; or specific file formats like “.md file.” These “magic words” are prompting shortcuts that bring in dense clusters of context and trigger particular patterns of behavior that have specific use cases.

But right now, these magic words are unmodifiable. They exist in the model’s training, within system prompts, or locked inside proprietary features. You can use them if you know about them, and you can write prompts to modify how they work in your current session. But you can’t inspect them to understand exactly what they do, you can’t tweak them for your needs, and you can’t share your improved version with others.

Skills and MCPs are a way to make magic words visible and extensible. They formalize the instructions and patterns that make an LLM application work, and they make those instructions something you can read, modify, and share.

Take ChatGPT’s study mode as an example. It’s a particular way of helping someone learn, by asking comprehension questions, testing understanding, and adjusting difficulty based on responses. That’s incredibly valuable. But it’s locked inside ChatGPT’s interface. You can’t even access it via the ChatGPT API. What if study mode was published as a skill? Then you could:

  • See exactly how it works. What instructions guide the interaction?
  • Modify it for your subject matter. Maybe study mode for medical students needs different patterns than study mode for language learning.
  • Fork it into variants. You might want a “Socratic mode” or “test prep mode” that builds on the same foundation.
  • Use it with your own content and tools. You might combine it with an MCP server that accesses your course materials.
  • Share your improved version and learn from others’ modifications.

This is the next level of AI programming “up the stack.” You’re not training models or vibe coding Python. You’re elaborating on concepts the model already understands, more adapted to specific needs, and sharing them as building blocks others can use.

Building reusable libraries of fuzzy functions is the future of open source AI.

The Economics of Participation

There’s a deeper pattern here that connects to a rich tradition in economics: mechanism design. Over the past few decades, economists like Paul Milgrom and Al Roth won Nobel Prizes for showing how to design better markets: matching systems for medical residents, spectrum auctions for wireless licenses, kidney exchange networks that save lives. These weren’t just theoretical exercises. They were practical interventions that created more efficient, more equitable outcomes by changing the rules of the game.

Some tech companies understood this. As chief economist at Google, Hal Varian didn’t just analyze ad markets, he helped design the ad auction that made Google’s business model work. At Uber, Jonathan Hall applied mechanism design insights to dynamic pricing and marketplace matching to build a “thick market” of passengers and drivers. These economists brought economic theory to bear on platform design, creating systems where value could flow more efficiently between participants.

Though not guided by economists, the web and the open source software revolution were also not just technical advances but breakthroughs in market design. They created information-rich, participatory markets where barriers to entry were lowered. It became easier to learn, create, and innovate. Transaction costs plummeted. Sharing code or content went from expensive (physical distribution, licensing negotiations) to nearly free. Discovery mechanisms emerged: Search engines, package managers, and GitHub made it easy to find what you needed. Reputation systems were discovered or developed. And of course, network effects benefited everyone. Each new participant made the ecosystem more valuable.

These weren’t accidents. They were the result of architectural choices that made internet-enabled software development into a generative, participatory market.

AI desperately needs similar breakthroughs in mechanism design. Right now, most economic analysis of AI focuses on the wrong question: “How many jobs will AI destroy?” This is the mindset of an extractive system, where AI is something done to workers and to existing companies rather than with them. The right question is: “How do we design AI systems that create participatory markets where value can flow to all contributors?”

Consider what’s broken right now:

  • Attribution is invisible. When an AI model benefits from training on someone’s work, there’s no mechanism to recognize or compensate for that contribution.
  • Value capture is concentrated. A handful of companies capture the gains, while millions of content creators, whose work trained the models and are consulted during inference, see no return.
  • Improvement loops are closed. If you find a better way to accomplish a task with AI, you can’t easily share that improvement or benefit from others’ discoveries.
  • Quality signals are weak. There’s no good way to know if a particular skill, prompt, or MCP server is well-designed without trying it yourself.

MCP and skills, viewed through this economic lens, are early-stage infrastructure for a participatory AI market. The MCP Registry and skills gallery are primitive but promising marketplaces with discoverable components and inspectable quality. When a skill or MCP server is useful, it’s a legible, shareable artifact that can carry attribution. While this may not redress the “original sin” of copyright violation during model training, it does perhaps point to a future where content creators, not just AI model creators and app developers, may be able to monetize their work.

But we’re nowhere near having the mechanisms we need. We need systems that efficiently match AI capabilities with human needs, that create sustainable compensation for contribution, that enable reputation and discovery, that make it easy to build on others’ work while giving them credit.

This isn’t just a technical challenge. It’s a challenge for economists, policymakers, and platform designers to work together on mechanism design. The architecture of participation isn’t just a set of values. It’s a powerful framework for building markets that work. The question is whether we’ll apply these lessons of open source and the web to AI or whether we’ll let AI become an extractive system that destroys more value than it creates.

A Call to Action

I’d love to see OpenAI, Google, Meta, and the open source community develop a robust architecture of participation for AI.

Make innovations inspectable. When you build a compelling feature or an effective interaction pattern or a useful specialization, consider publishing it in a form others can learn from. Not as a closed app or an API to a black box but as instructions, prompts, and tool configurations that can be read and understood. Sometimes competitive advantage comes from what you share rather than what you keep secret.

Support open protocols. MCP’s early success demonstrates what’s possible when the industry rallies around an open standard. Since Anthropic introduced it in late 2024, MCP has been adopted by OpenAI (across ChatGPT, the Agents SDK, and the Responses API), Google (in the Gemini SDK), Microsoft (in Azure AI services), and a rapidly growing ecosystem of development tools from Replit to Sourcegraph. This cross-platform adoption proves that when a protocol solves real problems and remains truly open, companies will embrace it even when it comes from a competitor. The challenge now is to maintain that openness as the protocol matures.

Create pathways for contribution at every level. Not everyone needs to fork model weights or even write MCP servers. Some people should be able to contribute a clever prompt template. Others might write a skill that combines existing tools in a new way. Still others will build infrastructure that makes all of this easier. All of these contributions should be possible, visible, and valued.

Document magic. When your model responds particularly well to certain instructions, patterns, or concepts, make those patterns explicit and shareable. The collective knowledge of how to work effectively with AI shouldn’t be scattered across X threads and Discord channels. It should be formalized, versioned, and forkable.

Reinvent open source licenses. Take into account the need for recognition not only during training but inference. Develop protocols that help manage rights for data that flows through networks of AI agents.

Engage with mechanism design. Building a participatory AI market isn’t just a technical problem, it’s an economic design challenge. We need economists, policymakers, and platform designers collaborating on how to create sustainable, participatory markets around AI. Stop asking “How many jobs will AI destroy?” and start asking “How do we design AI systems that create value for all participants?” The architecture choices we make now will determine whether AI becomes an extractive force or an engine of broadly shared prosperity.

The future of programming with AI won’t be determined by who publishes model weights. It’ll be determined by who creates the best ways for ordinary developers to participate, contribute, and build on each other’s work. And that includes the next wave of developers: users who can create reusable AI skills based on their special knowledge, experience, and human perspectives.

We’re at a choice point. We can make AI development look like app stores and proprietary platforms, or we can make it look like the open web and the open source lineages that descended from Unix. I know which future I’d like to live in.


Footnotes

  1. I shared a draft of this piece with members of the Anthropic MCP and Skills team, and in addition to providing a number of helpful technical improvements, they confirmed a number of points where my framing captured their intentions. Comments ranged from “Skills were designed with composability in mind. We didn’t want to confine capable models to a single system prompt with limited functions” to “I love this phrasing since it leads into considering the models as the processing power, and showcases the need for the open ecosystem on top of the raw power a model provides” and “In a recent talk, I compared the models to processors, agent runtimes/orchestrations to the OS, and Skills as the application.” However, all of the opinions are my own and Anthropic is not responsible for anything I’ve said here.

Job for 2027: Senior Director of Million-Dollar Regexes

24 November 2025 at 07:04
The following article originally appeared on Medium and is being republished here with the author’s permission.

Don’t get me wrong, I’m up all night using these tools.

But I also sense we’re heading for an expensive hangover. The other day, a colleague told me about a new proposal to route a million documents a day through a system that identifies and removes Social Security numbers.

I joked that this was going to be a “million-dollar regular expression.”

Run the math on the “naïve” implementation with full GPT-5 and it’s eye-watering: A million messages a day at ~50K characters each works out to around 12.5 billion tokens daily, or $15,000 a day at current pricing. That’s nearly $6 million a year to check for Social Security numbers. Even if you migrate to GPT-5 Nano, you still spend about $230,000 a year.

That’s a success. You “saved” $5.77 million a year…

How about running this code for a million documents a day? How much would this cost:

import re; s = re.sub(r”\b\d{3}[- ]?\d{2}[- ]?\d{4}\b”, “[REDACTED]”, s)

A plain old EC2 instance could handle this… A single EC2 instance—something like an m1.small at 30 bucks a month—could churn through the same workload with a regex and cost you a few hundred dollars a year.

Which means that in practice, companies will be calling people like me in a year saying, “We’re burning a million dollars to do something that should cost a fraction of that—can you fix it?”

From $15,000/day to $0.96/day—I do think we’re about to see a lot of companies realize that a thinking model connected to an MCP server is way more expensive than just paying someone to write a bash script. Starting now, you’ll be able to make a career out of un-LLM-ifying applications.

❌
❌